GPU Memory Management: HBM3, GDDR6, Mixed Precision, ZeRO Optimizer

GPU memory is the most constrained resource in modern AI and HPC workloads. A 70-billion-parameter language model in FP32 precision requires 280 GB of GPU memory — far exceeding the 80 GB available in a single H100. Even models that technically fit within GPU memory often fail at training time because optimizer states, gradients, and activations add 3–4× the parameter memory overhead. Understanding GPU memory is fundamental to designing and operating GPU clusters efficiently.

GPU Memory Hierarchy

Modern GPU memory has multiple levels with very different characteristics:

Registers (on-chip, per thread): The fastest memory. Each CUDA core has a fixed register file (~256 KB per SM). Variables declared as local scalars typically end up in registers. Register spilling to local memory (which maps to global memory) is a serious performance problem.

Shared memory / L1 cache (on-chip, per SM): 0–228 KB per SM (H100), configurable. Explicitly managed with __shared__ declarations. 100x faster than global memory for repeated access patterns. Critical for matrix multiplication kernels.

L2 cache (on-chip, all SMs): 50 MB (H100). Automatic, transparent. Benefits from spatial and temporal locality in access patterns.

Global memory (HBM — off-chip): The main GPU memory. 80 GB (H100 SXM) with 3.35 TB/s bandwidth. All arrays and tensors reside here. Every cache miss eventually hits HBM.

HBM3 vs GDDR6: Technology Comparison

Specification	HBM3 (H100 SXM)	GDDR6X (RTX 4090)
Bandwidth	3.35 TB/s	1 TB/s
Capacity	80 GB	24 GB
Bus width	5120-bit	384-bit
Power	~90 W	~40 W
Physical form	Stacked on package	Separate chips
Latency	Higher	Lower

HBM3’s massive bandwidth advantage (3× over GDDR6X) makes it the right choice for memory-bandwidth-bound workloads: large batch matrix operations, all-reduce across GPUs, training large models. GDDR6 on consumer GPUs offers lower latency (beneficial for small batch inference) and lower cost per GB.

Detecting Memory Bottlenecks

Before optimizing, measure whether your kernel is actually memory-bandwidth-bound:

# Profile with Nsight Compute
ncu --metrics \
  l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,\
  l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum,\
  sm__throughput.avg.pct_of_peak_sustained_elapsed,\
  gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed \
  ./my_app

# Key metrics to examine:
# gpu__compute_memory_throughput > 80%: memory-bound
# sm__throughput > 80% with low memory throughput: compute-bound
# Both low: kernel has synchronization overhead or insufficient parallelism

For PyTorch workloads:

from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    with record_function("model_inference"):
        output = model(input_tensor)

print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))

Mixed Precision Training

Switching from FP32 to BF16 (bfloat16) halves memory usage and doubles throughput on Tensor Cores (H100 has 4 Tensor Cores per SM with 990 TFLOPS BF16 vs 495 TFLOPS FP32):

import torch
from torch.cuda.amp import autocast, GradScaler

model = MyModel().cuda()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
scaler = GradScaler()    # handles FP16 gradient scaling (not needed for BF16)

for batch in dataloader:
    optimizer.zero_grad()

    # Run forward pass in BF16
    with autocast(dtype=torch.bfloat16):
        output = model(batch['input'])
        loss = criterion(output, batch['target'])

    # Backward pass in FP32 (for numerical stability)
    loss.backward()

    # Optional: gradient clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

BF16 is preferred over FP16 for training because it has the same exponent range as FP32, eliminating the overflow/underflow issues that require gradient scaling in FP16 training.

Gradient Accumulation

When a single batch does not fit in GPU memory, gradient accumulation simulates a larger effective batch by accumulating gradients over multiple forward passes before updating:

accumulation_steps = 8   # effective batch = batch_size × 8
optimizer.zero_grad()

for step, (data, target) in enumerate(dataloader):
    with autocast(dtype=torch.bfloat16):
        output = model(data)
        loss = criterion(output, target) / accumulation_steps

    loss.backward()

    if (step + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

This runs with batch_size × 1 memory footprint but achieves the convergence behavior of batch_size × accumulation_steps.

Gradient Checkpointing

Gradient checkpointing (also called activation recomputation) trades compute for memory. Instead of storing all intermediate activations for backpropagation, only a subset of checkpoints is stored. The non-stored activations are recomputed during the backward pass.

from torch.utils.checkpoint import checkpoint, checkpoint_sequential

# Apply to a specific sub-module
class TransformerLayer(nn.Module):
    def forward(self, x):
        # Without checkpointing: stores all activations
        return self.attention(self.ffn(x))

class CheckpointedTransformer(nn.Module):
    def forward(self, x):
        # With checkpointing: recomputes attention activations during backward
        return checkpoint(self.layer.forward, x, use_reentrant=False)

# For sequential models: checkpoint every N layers
output = checkpoint_sequential(model.layers, segments=4, input=x)

Memory reduction is roughly proportional to the checkpoint granularity. Checkpointing every transformer layer reduces activation memory from O(layers × sequence_length) to O(sqrt(layers × sequence_length)). Compute cost increases by 20–30%.

Flash Attention

Flash Attention (and Flash Attention 2/3) is an algorithm that computes multi-head attention without materializing the full N×N attention matrix in HBM. It tiles computation to use shared memory, reducing HBM reads/writes from O(N²) to O(N):

# PyTorch 2.0+ includes Flash Attention via F.scaled_dot_product_attention
from torch.nn.functional import scaled_dot_product_attention

# Automatically uses Flash Attention when available (CUDA + compatible GPU)
output = scaled_dot_product_attention(
    query, key, value,
    attn_mask=None,
    dropout_p=0.0,
    is_causal=True    # causal masking for language models
)

# Using the flash-attn library directly
from flash_attn import flash_attn_qkvpacked_func
qkv = torch.stack([q, k, v], dim=2)   # [batch, seq, 3, heads, head_dim]
output = flash_attn_qkvpacked_func(qkv, dropout_p=0.0, causal=True)

Flash Attention 2 achieves ~2–4× speedup over standard attention on H100 and reduces memory from O(N²) to O(N) in sequence length, enabling training with much longer sequences.

ZeRO Optimizer

ZeRO (Zero Redundancy Optimizer) distributes optimizer state, gradients, and parameters across multiple GPUs instead of replicating them on each GPU:

ZeRO Stage	Partitioned Data	Memory Savings
Stage 1	Optimizer states	4×
Stage 2	Optimizer states + gradients	8×
Stage 3	Optimizer states + gradients + parameters	64×

# Using DeepSpeed ZeRO Stage 3
from deepspeed import DeepSpeedConfig

ds_config = {
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",   # offload optimizer states to CPU RAM
            "pin_memory": True
        },
        "offload_param": {
            "device": "cpu"    # offload parameters to CPU RAM
        }
    },
    "bf16": {"enabled": True},
    "gradient_clipping": 1.0,
    "train_micro_batch_size_per_gpu": 4,
    "gradient_accumulation_steps": 8
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config
)

ZeRO Stage 3 with CPU offloading can train 100B+ parameter models on clusters where GPU memory alone would be insufficient.

GPU memory management is the primary constraint in modern AI workloads. The combination of BF16, gradient checkpointing, Flash Attention, and ZeRO can reduce memory requirements by 10–20×, enabling training of much larger models on existing hardware. For GPU cluster memory architecture and training optimization, contact Mevasis.

GPU Memory Management for HPC and AI: Hierarchy, Bottlenecks, and Optimization