GPU Memory Management for HPC and AI: Hierarchy, Bottlenecks, and Optimization
GPU memory hierarchy for HPC and AI workloads: HBM3 vs GDDR6 comparison, detecting memory bottlenecks with ncu, mixed precision training, gradient accumulation, gradient checkpointing, Flash Attention, and ZeRO optimizer.
GPU memory is the most constrained resource in modern AI and HPC workloads. A 70-billion-parameter language model in FP32 precision requires 280 GB of GPU memory — far exceeding the 80 GB available in a single H100. Even models that technically fit within GPU memory often fail at training time because optimizer states, gradients, and activations add 3–4× the parameter memory overhead. Understanding GPU memory is fundamental to designing and operating GPU clusters efficiently.
GPU Memory Hierarchy
Modern GPU memory has multiple levels with very different characteristics:
Registers (on-chip, per thread): The fastest memory. Each CUDA core has a fixed register file (~256 KB per SM). Variables declared as local scalars typically end up in registers. Register spilling to local memory (which maps to global memory) is a serious performance problem.
Shared memory / L1 cache (on-chip, per SM): 0–228 KB per SM (H100), configurable. Explicitly managed with __shared__ declarations. 100x faster than global memory for repeated access patterns. Critical for matrix multiplication kernels.
L2 cache (on-chip, all SMs): 50 MB (H100). Automatic, transparent. Benefits from spatial and temporal locality in access patterns.
Global memory (HBM — off-chip): The main GPU memory. 80 GB (H100 SXM) with 3.35 TB/s bandwidth. All arrays and tensors reside here. Every cache miss eventually hits HBM.
HBM3 vs GDDR6: Technology Comparison
| Specification | HBM3 (H100 SXM) | GDDR6X (RTX 4090) |
|---|---|---|
| Bandwidth | 3.35 TB/s | 1 TB/s |
| Capacity | 80 GB | 24 GB |
| Bus width | 5120-bit | 384-bit |
| Power | ~90 W | ~40 W |
| Physical form | Stacked on package | Separate chips |
| Latency | Higher | Lower |
HBM3’s massive bandwidth advantage (3× over GDDR6X) makes it the right choice for memory-bandwidth-bound workloads: large batch matrix operations, all-reduce across GPUs, training large models. GDDR6 on consumer GPUs offers lower latency (beneficial for small batch inference) and lower cost per GB.
Detecting Memory Bottlenecks
Before optimizing, measure whether your kernel is actually memory-bandwidth-bound:
# Profile with Nsight Compute
ncu --metrics \
l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,\
l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum,\
sm__throughput.avg.pct_of_peak_sustained_elapsed,\
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed \
./my_app
# Key metrics to examine:
# gpu__compute_memory_throughput > 80%: memory-bound
# sm__throughput > 80% with low memory throughput: compute-bound
# Both low: kernel has synchronization overhead or insufficient parallelism
For PyTorch workloads:
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
with record_function("model_inference"):
output = model(input_tensor)
print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))
Mixed Precision Training
Switching from FP32 to BF16 (bfloat16) halves memory usage and doubles throughput on Tensor Cores (H100 has 4 Tensor Cores per SM with 990 TFLOPS BF16 vs 495 TFLOPS FP32):
import torch
from torch.cuda.amp import autocast, GradScaler
model = MyModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = GradScaler() # handles FP16 gradient scaling (not needed for BF16)
for batch in dataloader:
optimizer.zero_grad()
# Run forward pass in BF16
with autocast(dtype=torch.bfloat16):
output = model(batch['input'])
loss = criterion(output, batch['target'])
# Backward pass in FP32 (for numerical stability)
loss.backward()
# Optional: gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
BF16 is preferred over FP16 for training because it has the same exponent range as FP32, eliminating the overflow/underflow issues that require gradient scaling in FP16 training.
Gradient Accumulation
When a single batch does not fit in GPU memory, gradient accumulation simulates a larger effective batch by accumulating gradients over multiple forward passes before updating:
accumulation_steps = 8 # effective batch = batch_size × 8
optimizer.zero_grad()
for step, (data, target) in enumerate(dataloader):
with autocast(dtype=torch.bfloat16):
output = model(data)
loss = criterion(output, target) / accumulation_steps
loss.backward()
if (step + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
This runs with batch_size × 1 memory footprint but achieves the convergence behavior of batch_size × accumulation_steps.
Gradient Checkpointing
Gradient checkpointing (also called activation recomputation) trades compute for memory. Instead of storing all intermediate activations for backpropagation, only a subset of checkpoints is stored. The non-stored activations are recomputed during the backward pass.
from torch.utils.checkpoint import checkpoint, checkpoint_sequential
# Apply to a specific sub-module
class TransformerLayer(nn.Module):
def forward(self, x):
# Without checkpointing: stores all activations
return self.attention(self.ffn(x))
class CheckpointedTransformer(nn.Module):
def forward(self, x):
# With checkpointing: recomputes attention activations during backward
return checkpoint(self.layer.forward, x, use_reentrant=False)
# For sequential models: checkpoint every N layers
output = checkpoint_sequential(model.layers, segments=4, input=x)
Memory reduction is roughly proportional to the checkpoint granularity. Checkpointing every transformer layer reduces activation memory from O(layers × sequence_length) to O(sqrt(layers × sequence_length)). Compute cost increases by 20–30%.
Flash Attention
Flash Attention (and Flash Attention 2/3) is an algorithm that computes multi-head attention without materializing the full N×N attention matrix in HBM. It tiles computation to use shared memory, reducing HBM reads/writes from O(N²) to O(N):
# PyTorch 2.0+ includes Flash Attention via F.scaled_dot_product_attention
from torch.nn.functional import scaled_dot_product_attention
# Automatically uses Flash Attention when available (CUDA + compatible GPU)
output = scaled_dot_product_attention(
query, key, value,
attn_mask=None,
dropout_p=0.0,
is_causal=True # causal masking for language models
)
# Using the flash-attn library directly
from flash_attn import flash_attn_qkvpacked_func
qkv = torch.stack([q, k, v], dim=2) # [batch, seq, 3, heads, head_dim]
output = flash_attn_qkvpacked_func(qkv, dropout_p=0.0, causal=True)
Flash Attention 2 achieves ~2–4× speedup over standard attention on H100 and reduces memory from O(N²) to O(N) in sequence length, enabling training with much longer sequences.
ZeRO Optimizer
ZeRO (Zero Redundancy Optimizer) distributes optimizer state, gradients, and parameters across multiple GPUs instead of replicating them on each GPU:
| ZeRO Stage | Partitioned Data | Memory Savings |
|---|---|---|
| Stage 1 | Optimizer states | 4× |
| Stage 2 | Optimizer states + gradients | 8× |
| Stage 3 | Optimizer states + gradients + parameters | 64× |
# Using DeepSpeed ZeRO Stage 3
from deepspeed import DeepSpeedConfig
ds_config = {
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu", # offload optimizer states to CPU RAM
"pin_memory": True
},
"offload_param": {
"device": "cpu" # offload parameters to CPU RAM
}
},
"bf16": {"enabled": True},
"gradient_clipping": 1.0,
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 8
}
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
config=ds_config
)
ZeRO Stage 3 with CPU offloading can train 100B+ parameter models on clusters where GPU memory alone would be insufficient.
GPU memory management is the primary constraint in modern AI workloads. The combination of BF16, gradient checkpointing, Flash Attention, and ZeRO can reduce memory requirements by 10–20×, enabling training of much larger models on existing hardware. For GPU cluster memory architecture and training optimization, contact Mevasis.