GPU Cluster Technical Guide: Architecture, Parallelism Strategies, and Best Practices
GPU cluster technical guide: DGX H100 and HGX H100 architecture, data/model/pipeline/tensor parallelism, SLURM vs Kubernetes scheduling, network bottlenecks, GPU memory issues, thermal management, benchmarks, and best practices.
Large language model training, computational fluid dynamics simulation, and high-throughput inference workloads have long exceeded the capacity of a single GPU. GPU clusters — multiple GPU-equipped servers connected by high-bandwidth networks — have become the foundation of both AI research and scientific computing. This guide covers the architecture, parallelism strategies, scheduling options, and operational best practices for production GPU clusters.
GPU Cluster Architecture: Three Layers
A GPU cluster consists of three tightly integrated layers: compute hardware, network, and storage.
Compute Hardware Layer
NVIDIA DGX H100: Eight H100 80GB GPUs connected via NVLink4 (900 GB/s GPU-to-GPU bandwidth within the node), 2 Intel Xeon Platinum 8480+ CPUs, 2 TB DDR5 system RAM. Total 640 GB GPU memory per node. The NVSwitch in DGX H100 provides all-to-all GPU connectivity at full bandwidth — any GPU can transfer to any other GPU in the node at 900 GB/s.
NVIDIA HGX H100: The OEM version of DGX H100, available from major server vendors (Dell, HPE, Supermicro, Lenovo). Same GPU configuration with vendor-specific chassis and CPU choices. Allows custom memory and storage configurations not available in the fixed DGX form factor.
PCIe GPU servers: Standard rack servers with 2–4 GPUs connected via PCIe (not NVLink). Significantly lower inter-GPU bandwidth within the node (~64 GB/s PCIe Gen4 vs. 900 GB/s NVLink4). Suitable for inference, smaller training jobs, and mixed CPU/GPU workloads.
Network Layer
Inter-node GPU communication (gradient synchronization in distributed training, all-reduce in MPI) is the dominant performance bottleneck in GPU clusters. Two options:
NVIDIA InfiniBand NDR (400 Gb/s): Lowest latency (<1 µs), highest bandwidth, RDMA-native. NCCL and MPI communicate directly via InfiniBand without CPU involvement. Essential for tight-coupled distributed training.
RoCE v2 (RDMA over Converged Ethernet): RDMA over Ethernet. Lower cost if existing 100GbE infrastructure can be reused. Requires careful PFC (Priority Flow Control) and ECN (Explicit Congestion Notification) configuration. Performance is close to InfiniBand when properly tuned.
Storage Layer
Scratch (parallel filesystem): BeeGFS or Lustre for job-local data. Training datasets must be read fast enough to saturate GPU compute — insufficient storage bandwidth starves the GPUs. Target: > 10 GB/s aggregate for standard clusters, > 100 GB/s for large GPU clusters.
NVMe local scratch: Per-node NVMe SSDs (8–16 TB) for pre-staging datasets before training runs. Eliminates parallel filesystem load during training.
Object storage: For large dataset archives and model checkpoints. Integrated via S3 client or datacache layer.
Parallelism Strategies
Training large models requires distributing work across GPUs. Four parallelism strategies are used, often in combination:
Data parallelism: Each GPU holds a complete copy of the model but processes different data mini-batches. Gradients are aggregated (all-reduce) across GPUs after each batch. Scales well for small-to-medium models. Implemented via PyTorch DDP (DistributedDataParallel) or Horovod.
Model parallelism: Model layers are distributed across GPUs. GPU 0 holds layer 1–N/4, GPU 1 holds layer N/4+1 to N/2, etc. Required when the model does not fit in a single GPU’s memory. Inter-GPU communication occurs between adjacent layer groups.
Pipeline parallelism: Like model parallelism, but with micro-batching to keep all GPUs busy simultaneously. While GPU 0 processes micro-batch 1 on its layers, GPU 1 processes micro-batch 0 on its layers. Reduces “pipeline bubble” idle time.
Tensor parallelism: Individual matrix multiplications (e.g., attention heads in transformers) are split across multiple GPUs. Requires all-reduce within each layer. Most efficient when GPUs are connected via NVLink (low latency, high bandwidth).
Modern LLM training combines all four (called 3D parallelism): tensor parallelism within a node (NVLink), pipeline parallelism across node groups, and data parallelism across pipeline replicas.
Job Scheduler: SLURM vs Kubernetes
SLURM is appropriate for:
- Traditional HPC research workloads
- Batch-oriented GPU training jobs
- Organizations already running SLURM for CPU workloads
- Strong fairshare and accounting requirements
#!/bin/bash
#SBATCH --job-name=llm-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=48:00:00
#SBATCH --partition=gpu
srun torchrun \
--nnodes=4 \
--nproc_per_node=8 \
train.py --model llama3-70b --batch-size 256
Kubernetes with GPU Operator is appropriate for:
- Container-native MLOps pipelines
- Multi-tenant environments with Kubernetes expertise
- Dynamic scaling requirements
- Integration with Kubeflow, Ray, or Argo Workflows
Common Problems and Solutions
Network bottleneck (all-reduce slow): Distributed training performance is limited by all-reduce bandwidth. Symptoms: GPU utilization spikes to 100% during forward/backward passes but drops to near 0% during gradient synchronization.
# Diagnose with NCCL debug output
NCCL_DEBUG=INFO torchrun --nnodes=4 ... train.py 2>&1 | grep -E "NCCL|Ring"
# Force NCCL to use InfiniBand (not TCP)
export NCCL_IB_HCA=mlx5_0,mlx5_1
export NCCL_SOCKET_IFNAME=^lo,^docker0
GPU Out-Of-Memory (OOM): Model or batch size exceeds available GPU memory.
# Enable gradient checkpointing (reduces memory at cost of recomputation)
from torch.utils.checkpoint import checkpoint_sequential
model.gradient_checkpointing_enable()
# Use mixed precision (BF16 reduces memory by ~50%)
from torch.cuda.amp import autocast
with autocast(dtype=torch.bfloat16):
output = model(input)
# Reduce batch size and increase gradient accumulation steps
optimizer.zero_grad()
for i, (data, target) in enumerate(loader):
output = model(data)
loss = criterion(output, target) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Thermal throttling: GPUs reduce clock speed when they exceed their maximum temperature (typically 83°C for H100). Monitor with DCGM:
# Real-time GPU temperature monitoring
dcgmi dmon -e 203,252 # GPU temperature and power
# SLURM job that exceeded thermal limit shows lower throughput
seff <jobid>
Checkpoint failures on long training runs: Training jobs lasting days will experience node failures. Implement periodic model checkpointing:
# Save checkpoint every N steps
if global_step % save_interval == 0:
torch.save({
'step': global_step,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f'checkpoint-step-{global_step}.pt')
Benchmark Validation
Before production deployment:
# NCCL all-reduce bandwidth test
python /usr/local/lib/python3.10/dist-packages/torch/testing/_internal/distributed/multi_proc_popen_process.py
# More directly: nccl-tests
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make MPI=1 CUDA_HOME=/usr/local/cuda
mpirun -np 32 -hostfile hostfile ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 8
# HPL for compute validation
mpirun -np 64 ./xhpl # target: > 80% of FP64 peak
# MPI bandwidth between nodes
mpirun -np 2 -host gpu01,gpu02 ./IMB-MPI1 PingBandwidth
Expected NCCL all-reduce performance for DGX H100 × 4 nodes with NDR InfiniBand: > 150 GB/s effective bandwidth at 8 GB message size.
Best Practices
- Software version consistency: Ensure identical CUDA, cuDNN, NCCL, MPI, and PyTorch/TensorFlow versions across all nodes. Version mismatch causes cryptic failures at scale.
- DCGM monitoring: Deploy DCGM Exporter on all GPU nodes. GPU memory errors (ECC), temperature, power, and utilization must be visible in real time.
- Periodic checkpoint discipline: Never start a training run longer than 4 hours without automatic checkpointing.
- GPU resource quotas: In multi-tenant environments, define SLURM GRES limits or Kubernetes ResourceQuotas per team before the first user submits jobs.
GPU cluster design is an integrated engineering discipline — hardware, network, storage, and software must be optimized together. Contact Mevasis for GPU cluster architecture, deployment, and performance tuning services.