/ Blog

CUDA Programming Guide: GPU Architecture, Kernels, Memory, and Profiling

Comprehensive CUDA programming guide: GPU vs CPU architecture, streaming multiprocessors, warps, thread hierarchy, memory types (global, shared, constant), kernel writing, streams, debugging, and performance optimization.

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform that enables developers to use GPU hardware for general-purpose computation. Since its introduction in 2007, CUDA has become the dominant programming model for GPU-accelerated scientific computing, deep learning, and data analytics.

GPU vs CPU Architecture

The fundamental architectural difference explains why GPUs excel at certain problems:

CharacteristicCPU (e.g., EPYC 9654)GPU (e.g., H100)
Cores96 high-frequency16,896 CUDA cores
Clock speed2.4–3.7 GHz1.8–3.35 GHz
Memory bandwidth~460 GB/s (DDR5)3.35 TB/s (HBM3)
Cache per coreLarge (L3 ~384 MB)Small (L1 per SM)
Optimization goalLow latency (single thread)High throughput (many threads)
Best use caseComplex branching logicRegular, data-parallel operations

CPUs are designed to minimize latency for a small number of sequential tasks. GPUs are designed to maximize throughput by running thousands of threads simultaneously, tolerating memory latency by switching between warps.

GPU Architecture: SM, Warp, Thread, Block, Grid

Understanding the execution hierarchy is essential for writing efficient CUDA code:

Streaming Multiprocessor (SM): The fundamental compute unit. An H100 has 132 SMs, each with 128 CUDA cores, 4 Tensor Core units, 64 KB L1 cache/shared memory, and a warp scheduler.

Warp: 32 threads that execute in lockstep on a single SM. This is the atomic unit of GPU scheduling. All 32 threads execute the same instruction at the same time (SIMT: Single Instruction Multiple Threads). Divergent code paths (if/else with different branches per thread) cause serialization.

Thread Block: 1–1024 threads organized in 1D, 2D, or 3D layout. All threads in a block share L1 cache/shared memory and can synchronize with __syncthreads(). One block runs entirely on one SM.

Grid: All thread blocks launched by a single kernel call. Blocks are distributed across SMs automatically.

Installation

# Install CUDA Toolkit (Ubuntu 22.04, CUDA 12.2)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update && apt-get install -y cuda-toolkit-12-2

# Verify installation
nvcc --version
nvidia-smi

# Set environment variables
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

Writing Your First CUDA Kernel

A CUDA kernel is a function that runs on the GPU, executed by thousands of threads simultaneously:

// vector_add.cu — add two vectors element-wise on the GPU

#include <cuda_runtime.h>
#include <stdio.h>

// __global__ marks this function as a GPU kernel (called from CPU, runs on GPU)
__global__ void vector_add(const float *a, const float *b, float *c, int n) {
    // Compute the global thread index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Bounds check: thread count may exceed vector length
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    int n = 1 << 20;  // 1M elements
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    // Initialize vectors on host
    for (int i = 0; i < n; i++) {
        h_a[i] = (float)i;
        h_b[i] = (float)(n - i);
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data from host to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel: 256 threads per block, enough blocks to cover all elements
    int threads_per_block = 256;
    int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;

    vector_add<<<blocks_per_grid, threads_per_block>>>(d_a, d_b, d_c, n);

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    // Copy result back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify
    printf("c[0] = %.1f (expected %.1f)\n", h_c[0], h_a[0] + h_b[0]);

    // Free memory
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);

    return 0;
}
# Compile and run
nvcc -O3 -arch=sm_90 vector_add.cu -o vector_add  # sm_90 = H100
./vector_add

Memory Hierarchy

GPU memory hierarchy is the most important topic for performance optimization:

Memory TypeLocationSizeLatencyScope
RegisterOn-chip~256 KB per SM1 cyclePer thread
Shared memory (L1)On-chip0–228 KB per SM20–30 cyclesPer block
L2 cacheOn-chip50 MB (H100)~200 cyclesAll SMs
Global memory (HBM)Off-chip80 GB (H100)600–800 cyclesAll threads
Constant memoryOff-chip (cached)64 KB1 cycle (if cached)All threads
Texture memoryOff-chip (cached)LimitedLow (spatial locality)All threads

Using shared memory to reduce global memory accesses is the single most impactful optimization for memory-bound kernels:

// Matrix multiplication using shared memory tiling
__global__ void matmul_shared(const float *A, const float *B, float *C, int n) {
    __shared__ float tile_A[32][32];
    __shared__ float tile_B[32][32];

    int row = blockIdx.y * 32 + threadIdx.y;
    int col = blockIdx.x * 32 + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < n / 32; t++) {
        // Load tiles from global memory to shared memory
        tile_A[threadIdx.y][threadIdx.x] = A[row * n + t * 32 + threadIdx.x];
        tile_B[threadIdx.y][threadIdx.x] = B[(t * 32 + threadIdx.y) * n + col];

        __syncthreads();  // Ensure all threads have loaded their tiles

        // Compute partial dot product from tile
        for (int k = 0; k < 32; k++) {
            sum += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];
        }

        __syncthreads();  // Ensure computation is done before loading next tile
    }

    if (row < n && col < n) C[row * n + col] = sum;
}

CUDA Streams

Streams allow overlapping data transfer with kernel execution, hiding PCIe latency:

// Create multiple streams for pipeline parallelism
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Stream 1: copy chunk 1 while stream 2 processes chunk 0
cudaMemcpyAsync(d_data1, h_data1, chunk_bytes, cudaMemcpyHostToDevice, stream1);
process_kernel<<<grid, block, 0, stream2>>>(d_result0, d_data0, chunk_size);

// Synchronize all streams
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Debugging and Profiling

# cuda-memcheck: detect memory errors (access violations, race conditions)
cuda-memcheck ./my_app

# Nsight Compute: kernel-level performance profiling
ncu --metrics l1tex__t_bytes,sm__throughput,gpu__compute_memory_throughput \
    ./my_app

# Nsight Systems: system-level profiling (CPU+GPU timeline)
nsys profile --trace=cuda,nvtx,osrt ./my_app
nsys stats report.nsys-rep

Key metrics to examine in Nsight Compute:

  • SM occupancy: Fraction of maximum warps that are active. Target > 50%.
  • Memory throughput: Bandwidth utilization as fraction of peak HBM bandwidth. Target > 70%.
  • Compute throughput: FLOP utilization as fraction of peak CUDA core throughput.
  • Stall reasons: If occupancy is high but throughput is low, stall analysis shows whether it is memory latency, synchronization, or pipeline dependencies.

Common Performance Pitfalls

Warp divergence: Branches where adjacent threads take different paths serializes execution. Restructure data so threads in the same warp take the same branch.

Uncoalesced memory access: Global memory accesses are most efficient when 32 consecutive threads access 32 consecutive memory addresses. Strided or random access patterns drop bandwidth by 10–32x.

Insufficient parallelism: Launching too few threads leaves SMs idle. Rule of thumb: at least 1024 threads per SM × number of SMs on the device.

Excessive host-device transfers: PCIe bandwidth (~32 GB/s PCIe Gen4) is 100x lower than HBM bandwidth. Minimize round trips between host and device; keep data on the GPU across multiple kernel calls.


CUDA programming is a deep subject that rewards careful study of memory access patterns and parallelism structure. For GPU cluster deployment and CUDA application optimization consulting, contact the Mevasis engineering team.