CUDA Programming Guide for HPC: Kernels, Memory Hierarchy, Streams

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform that enables developers to use GPU hardware for general-purpose computation. Since its introduction in 2007, CUDA has become the dominant programming model for GPU-accelerated scientific computing, deep learning, and data analytics.

GPU vs CPU Architecture

The fundamental architectural difference explains why GPUs excel at certain problems:

Characteristic	CPU (e.g., EPYC 9654)	GPU (e.g., H100)
Cores	96 high-frequency	16,896 CUDA cores
Clock speed	2.4–3.7 GHz	1.8–3.35 GHz
Memory bandwidth	~460 GB/s (DDR5)	3.35 TB/s (HBM3)
Cache per core	Large (L3 ~384 MB)	Small (L1 per SM)
Optimization goal	Low latency (single thread)	High throughput (many threads)
Best use case	Complex branching logic	Regular, data-parallel operations

CPUs are designed to minimize latency for a small number of sequential tasks. GPUs are designed to maximize throughput by running thousands of threads simultaneously, tolerating memory latency by switching between warps.

GPU Architecture: SM, Warp, Thread, Block, Grid

Understanding the execution hierarchy is essential for writing efficient CUDA code:

Streaming Multiprocessor (SM): The fundamental compute unit. An H100 has 132 SMs, each with 128 CUDA cores, 4 Tensor Core units, 64 KB L1 cache/shared memory, and a warp scheduler.

Warp: 32 threads that execute in lockstep on a single SM. This is the atomic unit of GPU scheduling. All 32 threads execute the same instruction at the same time (SIMT: Single Instruction Multiple Threads). Divergent code paths (if/else with different branches per thread) cause serialization.

Thread Block: 1–1024 threads organized in 1D, 2D, or 3D layout. All threads in a block share L1 cache/shared memory and can synchronize with __syncthreads(). One block runs entirely on one SM.

Grid: All thread blocks launched by a single kernel call. Blocks are distributed across SMs automatically.

Installation

# Install CUDA Toolkit (Ubuntu 22.04, CUDA 12.2)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update && apt-get install -y cuda-toolkit-12-2

# Verify installation
nvcc --version
nvidia-smi

# Set environment variables
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

Writing Your First CUDA Kernel

A CUDA kernel is a function that runs on the GPU, executed by thousands of threads simultaneously:

// vector_add.cu — add two vectors element-wise on the GPU

#include <cuda_runtime.h>
#include <stdio.h>

// __global__ marks this function as a GPU kernel (called from CPU, runs on GPU)
__global__ void vector_add(const float *a, const float *b, float *c, int n) {
    // Compute the global thread index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Bounds check: thread count may exceed vector length
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    int n = 1 << 20;  // 1M elements
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    // Initialize vectors on host
    for (int i = 0; i < n; i++) {
        h_a[i] = (float)i;
        h_b[i] = (float)(n - i);
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data from host to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel: 256 threads per block, enough blocks to cover all elements
    int threads_per_block = 256;
    int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;

    vector_add<<<blocks_per_grid, threads_per_block>>>(d_a, d_b, d_c, n);

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    // Copy result back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify
    printf("c[0] = %.1f (expected %.1f)\n", h_c[0], h_a[0] + h_b[0]);

    // Free memory
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);

    return 0;
}

# Compile and run
nvcc -O3 -arch=sm_90 vector_add.cu -o vector_add  # sm_90 = H100
./vector_add

Memory Hierarchy

GPU memory hierarchy is the most important topic for performance optimization:

Memory Type	Location	Size	Latency	Scope
Register	On-chip	~256 KB per SM	1 cycle	Per thread
Shared memory (L1)	On-chip	0–228 KB per SM	20–30 cycles	Per block
L2 cache	On-chip	50 MB (H100)	~200 cycles	All SMs
Global memory (HBM)	Off-chip	80 GB (H100)	600–800 cycles	All threads
Constant memory	Off-chip (cached)	64 KB	1 cycle (if cached)	All threads
Texture memory	Off-chip (cached)	Limited	Low (spatial locality)	All threads

Using shared memory to reduce global memory accesses is the single most impactful optimization for memory-bound kernels:

// Matrix multiplication using shared memory tiling
__global__ void matmul_shared(const float *A, const float *B, float *C, int n) {
    __shared__ float tile_A[32][32];
    __shared__ float tile_B[32][32];

    int row = blockIdx.y * 32 + threadIdx.y;
    int col = blockIdx.x * 32 + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < n / 32; t++) {
        // Load tiles from global memory to shared memory
        tile_A[threadIdx.y][threadIdx.x] = A[row * n + t * 32 + threadIdx.x];
        tile_B[threadIdx.y][threadIdx.x] = B[(t * 32 + threadIdx.y) * n + col];

        __syncthreads();  // Ensure all threads have loaded their tiles

        // Compute partial dot product from tile
        for (int k = 0; k < 32; k++) {
            sum += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];
        }

        __syncthreads();  // Ensure computation is done before loading next tile
    }

    if (row < n && col < n) C[row * n + col] = sum;
}

CUDA Streams

Streams allow overlapping data transfer with kernel execution, hiding PCIe latency:

// Create multiple streams for pipeline parallelism
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Stream 1: copy chunk 1 while stream 2 processes chunk 0
cudaMemcpyAsync(d_data1, h_data1, chunk_bytes, cudaMemcpyHostToDevice, stream1);
process_kernel<<<grid, block, 0, stream2>>>(d_result0, d_data0, chunk_size);

// Synchronize all streams
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Debugging and Profiling

# cuda-memcheck: detect memory errors (access violations, race conditions)
cuda-memcheck ./my_app

# Nsight Compute: kernel-level performance profiling
ncu --metrics l1tex__t_bytes,sm__throughput,gpu__compute_memory_throughput \
    ./my_app

# Nsight Systems: system-level profiling (CPU+GPU timeline)
nsys profile --trace=cuda,nvtx,osrt ./my_app
nsys stats report.nsys-rep

Key metrics to examine in Nsight Compute:

SM occupancy: Fraction of maximum warps that are active. Target > 50%.
Memory throughput: Bandwidth utilization as fraction of peak HBM bandwidth. Target > 70%.
Compute throughput: FLOP utilization as fraction of peak CUDA core throughput.
Stall reasons: If occupancy is high but throughput is low, stall analysis shows whether it is memory latency, synchronization, or pipeline dependencies.

Common Performance Pitfalls

Warp divergence: Branches where adjacent threads take different paths serializes execution. Restructure data so threads in the same warp take the same branch.

Uncoalesced memory access: Global memory accesses are most efficient when 32 consecutive threads access 32 consecutive memory addresses. Strided or random access patterns drop bandwidth by 10–32x.

Insufficient parallelism: Launching too few threads leaves SMs idle. Rule of thumb: at least 1024 threads per SM × number of SMs on the device.

Excessive host-device transfers: PCIe bandwidth (~32 GB/s PCIe Gen4) is 100x lower than HBM bandwidth. Minimize round trips between host and device; keep data on the GPU across multiple kernel calls.

CUDA programming is a deep subject that rewards careful study of memory access patterns and parallelism structure. For GPU cluster deployment and CUDA application optimization consulting, contact the Mevasis engineering team.

CUDA Programming Guide: GPU Architecture, Kernels, Memory, and Profiling