CUDA Programming Guide: GPU Architecture, Kernels, Memory, and Profiling
Comprehensive CUDA programming guide: GPU vs CPU architecture, streaming multiprocessors, warps, thread hierarchy, memory types (global, shared, constant), kernel writing, streams, debugging, and performance optimization.
CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform that enables developers to use GPU hardware for general-purpose computation. Since its introduction in 2007, CUDA has become the dominant programming model for GPU-accelerated scientific computing, deep learning, and data analytics.
GPU vs CPU Architecture
The fundamental architectural difference explains why GPUs excel at certain problems:
| Characteristic | CPU (e.g., EPYC 9654) | GPU (e.g., H100) |
|---|---|---|
| Cores | 96 high-frequency | 16,896 CUDA cores |
| Clock speed | 2.4–3.7 GHz | 1.8–3.35 GHz |
| Memory bandwidth | ~460 GB/s (DDR5) | 3.35 TB/s (HBM3) |
| Cache per core | Large (L3 ~384 MB) | Small (L1 per SM) |
| Optimization goal | Low latency (single thread) | High throughput (many threads) |
| Best use case | Complex branching logic | Regular, data-parallel operations |
CPUs are designed to minimize latency for a small number of sequential tasks. GPUs are designed to maximize throughput by running thousands of threads simultaneously, tolerating memory latency by switching between warps.
GPU Architecture: SM, Warp, Thread, Block, Grid
Understanding the execution hierarchy is essential for writing efficient CUDA code:
Streaming Multiprocessor (SM): The fundamental compute unit. An H100 has 132 SMs, each with 128 CUDA cores, 4 Tensor Core units, 64 KB L1 cache/shared memory, and a warp scheduler.
Warp: 32 threads that execute in lockstep on a single SM. This is the atomic unit of GPU scheduling. All 32 threads execute the same instruction at the same time (SIMT: Single Instruction Multiple Threads). Divergent code paths (if/else with different branches per thread) cause serialization.
Thread Block: 1–1024 threads organized in 1D, 2D, or 3D layout. All threads in a block share L1 cache/shared memory and can synchronize with __syncthreads(). One block runs entirely on one SM.
Grid: All thread blocks launched by a single kernel call. Blocks are distributed across SMs automatically.
Installation
# Install CUDA Toolkit (Ubuntu 22.04, CUDA 12.2)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update && apt-get install -y cuda-toolkit-12-2
# Verify installation
nvcc --version
nvidia-smi
# Set environment variables
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
Writing Your First CUDA Kernel
A CUDA kernel is a function that runs on the GPU, executed by thousands of threads simultaneously:
// vector_add.cu — add two vectors element-wise on the GPU
#include <cuda_runtime.h>
#include <stdio.h>
// __global__ marks this function as a GPU kernel (called from CPU, runs on GPU)
__global__ void vector_add(const float *a, const float *b, float *c, int n) {
// Compute the global thread index
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Bounds check: thread count may exceed vector length
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
int main() {
int n = 1 << 20; // 1M elements
size_t bytes = n * sizeof(float);
// Allocate host memory
float *h_a = (float*)malloc(bytes);
float *h_b = (float*)malloc(bytes);
float *h_c = (float*)malloc(bytes);
// Initialize vectors on host
for (int i = 0; i < n; i++) {
h_a[i] = (float)i;
h_b[i] = (float)(n - i);
}
// Allocate device memory
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
// Copy data from host to device
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
// Launch kernel: 256 threads per block, enough blocks to cover all elements
int threads_per_block = 256;
int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;
vector_add<<<blocks_per_grid, threads_per_block>>>(d_a, d_b, d_c, n);
// Wait for GPU to finish
cudaDeviceSynchronize();
// Copy result back to host
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
// Verify
printf("c[0] = %.1f (expected %.1f)\n", h_c[0], h_a[0] + h_b[0]);
// Free memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
free(h_a); free(h_b); free(h_c);
return 0;
}
# Compile and run
nvcc -O3 -arch=sm_90 vector_add.cu -o vector_add # sm_90 = H100
./vector_add
Memory Hierarchy
GPU memory hierarchy is the most important topic for performance optimization:
| Memory Type | Location | Size | Latency | Scope |
|---|---|---|---|---|
| Register | On-chip | ~256 KB per SM | 1 cycle | Per thread |
| Shared memory (L1) | On-chip | 0–228 KB per SM | 20–30 cycles | Per block |
| L2 cache | On-chip | 50 MB (H100) | ~200 cycles | All SMs |
| Global memory (HBM) | Off-chip | 80 GB (H100) | 600–800 cycles | All threads |
| Constant memory | Off-chip (cached) | 64 KB | 1 cycle (if cached) | All threads |
| Texture memory | Off-chip (cached) | Limited | Low (spatial locality) | All threads |
Using shared memory to reduce global memory accesses is the single most impactful optimization for memory-bound kernels:
// Matrix multiplication using shared memory tiling
__global__ void matmul_shared(const float *A, const float *B, float *C, int n) {
__shared__ float tile_A[32][32];
__shared__ float tile_B[32][32];
int row = blockIdx.y * 32 + threadIdx.y;
int col = blockIdx.x * 32 + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < n / 32; t++) {
// Load tiles from global memory to shared memory
tile_A[threadIdx.y][threadIdx.x] = A[row * n + t * 32 + threadIdx.x];
tile_B[threadIdx.y][threadIdx.x] = B[(t * 32 + threadIdx.y) * n + col];
__syncthreads(); // Ensure all threads have loaded their tiles
// Compute partial dot product from tile
for (int k = 0; k < 32; k++) {
sum += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];
}
__syncthreads(); // Ensure computation is done before loading next tile
}
if (row < n && col < n) C[row * n + col] = sum;
}
CUDA Streams
Streams allow overlapping data transfer with kernel execution, hiding PCIe latency:
// Create multiple streams for pipeline parallelism
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
// Stream 1: copy chunk 1 while stream 2 processes chunk 0
cudaMemcpyAsync(d_data1, h_data1, chunk_bytes, cudaMemcpyHostToDevice, stream1);
process_kernel<<<grid, block, 0, stream2>>>(d_result0, d_data0, chunk_size);
// Synchronize all streams
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
Debugging and Profiling
# cuda-memcheck: detect memory errors (access violations, race conditions)
cuda-memcheck ./my_app
# Nsight Compute: kernel-level performance profiling
ncu --metrics l1tex__t_bytes,sm__throughput,gpu__compute_memory_throughput \
./my_app
# Nsight Systems: system-level profiling (CPU+GPU timeline)
nsys profile --trace=cuda,nvtx,osrt ./my_app
nsys stats report.nsys-rep
Key metrics to examine in Nsight Compute:
- SM occupancy: Fraction of maximum warps that are active. Target > 50%.
- Memory throughput: Bandwidth utilization as fraction of peak HBM bandwidth. Target > 70%.
- Compute throughput: FLOP utilization as fraction of peak CUDA core throughput.
- Stall reasons: If occupancy is high but throughput is low, stall analysis shows whether it is memory latency, synchronization, or pipeline dependencies.
Common Performance Pitfalls
Warp divergence: Branches where adjacent threads take different paths serializes execution. Restructure data so threads in the same warp take the same branch.
Uncoalesced memory access: Global memory accesses are most efficient when 32 consecutive threads access 32 consecutive memory addresses. Strided or random access patterns drop bandwidth by 10–32x.
Insufficient parallelism: Launching too few threads leaves SMs idle. Rule of thumb: at least 1024 threads per SM × number of SMs on the device.
Excessive host-device transfers: PCIe bandwidth (~32 GB/s PCIe Gen4) is 100x lower than HBM bandwidth. Minimize round trips between host and device; keep data on the GPU across multiple kernel calls.
CUDA programming is a deep subject that rewards careful study of memory access patterns and parallelism structure. For GPU cluster deployment and CUDA application optimization consulting, contact the Mevasis engineering team.