HPC Capacity Planning: Core Formula, Processor Comparison, Storage Tiers, and Network
HPC capacity planning methodology: core capacity formula, Intel vs AMD vs ARM processor comparison, memory sizing by workload type, NUMA impact, 3-tier storage architecture, network technology table, fat-tree oversubscription, GPU capacity planning, Little's Law for queue analysis, and phased growth planning.
Capacity planning for HPC is the process of matching infrastructure investment to workload requirements — both current and projected. Unlike web applications where horizontal scaling is immediate, HPC hardware has 18–24 month procurement cycles. Getting capacity wrong means either chronic queue backlogs or stranded capital in idle servers.
Core Capacity Formula
The foundational calculation for compute node count:
Required cores = (avg_jobs_per_day × avg_cores_per_job × avg_walltime_hours)
÷ 24
× utilization_headroom_factor
utilization_headroom_factor = 1 / target_utilization
Example:
200 jobs/day × 128 cores/job × 6h avg walltime ÷ 24h = 6,400 cores
At target 70% utilization: 6,400 ÷ 0.7 = ~9,143 cores
With 64-core nodes: 143 nodes minimum
Add 20–30% for growth over the planned hardware lifetime (typically 5 years). The actual number to procure should account for the expected year-2 demand, not just year-1.
Processor Architecture Comparison
| Architecture | Best For | Cores/Socket | Memory Channels | Key Advantage |
|---|---|---|---|---|
| AMD EPYC 9654 (Genoa) | MPI simulation, genomics | 96 | 12-channel DDR5 | Core density, memory bandwidth |
| AMD EPYC 9755 (Turin) | High-throughput HPC | 128 | 12-channel DDR5 | Highest per-socket core count |
| Intel Xeon 6 Granite Rapids | MKL-optimized codes | 128 | 8-channel DDR5 | Single-thread perf, MKL ecosystem |
| Intel Xeon 6 Sierra Forest | High core count, moderate memory | 144 | 8-channel DDR5 | Power efficiency at scale |
| Ampere Altra | Power-constrained, cloud bursting | 128 | 8-channel DDR4 | Lowest power per core |
| Fujitsu A64FX | Bandwidth-bound simulation | 48 + HBM2 | HBM2 (native) | 1 TB/s memory bandwidth |
For new purchases in 2025–2026:
- General HPC simulation (MPI-heavy): AMD EPYC 9004 (Genoa) or 9005 (Turin) series
- MKL-dependent codes (ANSYS, MATLAB, Gaussian): Intel Xeon 6 Granite Rapids
- Power-constrained environments: Ampere Altra
Memory Sizing by Workload Type
Memory requirements vary by more than an order of magnitude across HPC applications:
| Workload | Memory per Core | Notes |
|---|---|---|
| Tight MPI simulation (OpenFOAM, NAMD) | 4–8 GB | Distributed memory limits per-node requirement |
| Shared-memory simulation (ANSYS Mechanical) | 8–32 GB | Entire model may reside in one node’s memory |
| Whole-genome assembly (SPAdes) | 64–512 GB | Memory-intensive graph algorithms |
| Large-scale CFD (Fluent, STAR-CCM+) | 8–16 GB | Mesh size determines memory |
| Monte Carlo simulation | 1–4 GB | Embarrassingly parallel, small per-process footprint |
| Deep learning training | 8–16 GB | GPU memory usually more critical |
| Seismic processing | 4–8 GB | Data streaming reduces per-core requirement |
Rule of thumb: Start with 4 GB per core as the minimum. If even 10% of your workload requires more than 16 GB/core, add dedicated high-memory nodes.
NUMA Impact on Memory Bandwidth
Modern multi-socket servers have NUMA (Non-Uniform Memory Access) topology. AMD EPYC 9654 with 12 CCDs has 12 NUMA domains — processes that access memory in a remote NUMA domain pay a 30–50% bandwidth penalty.
# Check NUMA topology
numactl --hardware
lscpu | grep NUMA
# NUMA-aware MPI binding example (EPYC 9654, 2 sockets, 12 NUMA each)
# 24 MPI ranks per node, one per NUMA domain
#SBATCH --ntasks-per-node=24
#SBATCH --cpus-per-task=8 # 4 cores per NUMA domain (96/24)
srun --cpu-bind=rank ./simulation
Applications that ignore NUMA topology on EPYC can see 30–50% lower memory bandwidth than expected. This is a frequent cause of underwhelming benchmark results on high-NUMA systems.
3-Tier Storage Architecture
| Tier | Technology | Capacity | Throughput | Use Case |
|---|---|---|---|---|
| Hot (scratch) | NVMe SSD parallel FS | 100–500 TB | 50–500 GB/s | Active job I/O |
| Warm (project) | HDD parallel FS | 500 TB–5 PB | 5–50 GB/s | Group/user data |
| Cold (archive) | Tape or object storage | 5–50 PB | 0.5–5 GB/s | Long-term retention |
Sizing scratch storage:
scratch_capacity = concurrent_jobs × max_job_data_size × 3
(3x for input + output + temporary files)
Example: 200 concurrent jobs × 2 TB/job × 3 = 1.2 PB minimum scratch
For genomics and seismic workloads where single jobs may generate 10–100 TB, scratch sizing is the dominant design decision.
Network Technology and Bandwidth
| Technology | Bandwidth/Port | Latency | Best For |
|---|---|---|---|
| InfiniBand NDR | 400 Gb/s | ~0.6 µs | Large GPU clusters, tight MPI |
| InfiniBand HDR | 200 Gb/s | ~1 µs | Standard HPC, medium clusters |
| RoCE v2 (100 GbE) | 100 Gb/s | ~3 µs | Cost-sensitive, existing Ethernet |
| 25 GbE Ethernet | 25 Gb/s | ~30 µs | Storage network, management |
| 10 GbE Ethernet | 10 Gb/s | ~50 µs | Management only |
Fat-tree oversubscription: A 1:1 (non-blocking) fat-tree requires equal uplink and downlink bandwidth at each aggregation tier. This is the maximum cost point. For most HPC workloads, 2:1 oversubscription is acceptable; only the tightest MPI workloads require non-blocking.
Non-blocking fat-tree (N nodes, each at 200 Gb/s):
Leaf tier: N/ports_per_switch leaf switches
Uplinks needed: N/2 × 200 Gb/s total
Core switches: N/2 × 200 Gb/s / ports_per_core_switch
GPU Capacity Planning
GPU capacity planning adds two dimensions: GPU memory and inter-GPU bandwidth.
GPU memory per node:
min_GPU_memory_per_node ≥ max_model_size_in_bytes × precision_factor
LLaMA-3 70B in FP16: 70B × 2 bytes = 140 GB → 2× H100 80GB minimum
GPT-4 class (assumed 1.8T parameters) in FP8: 1.8T × 1 byte = 1.8 TB
→ requires ZeRO Stage 3 + CPU offloading on a GPU cluster
NVLink bandwidth for tensor parallelism:
Within a single DGX H100 node (8× H100 with NVLink4): 900 GB/s all-to-all Across nodes via InfiniBand NDR: 400 Gb/s = 50 GB/s per port
Tensor parallelism requires NVLink-grade bandwidth. Only implement tensor parallelism within a node (or DGX pod with NVLink switch). Pipeline parallelism works across nodes via InfiniBand.
Queue Analysis with Little’s Law
Little’s Law from queuing theory provides a sanity check for cluster sizing:
L = λ × W
Where:
L = average number of jobs in system (running + queued)
λ = job arrival rate (jobs/hour)
W = average time in system (wait time + run time)
Example:
λ = 50 jobs/hour
W = 8h average run + 2h average wait = 10 hours
L = 50 × 10 = 500 concurrent jobs in system
If average job requires 128 cores:
Required capacity = 500 × 128 = 64,000 cores (at peak)
If your cluster has 32,000 cores and you observe L = 500 with W = 10h, queue wait time will be approximately 5h (Little’s Law confirms the cluster needs 2× expansion to reduce wait time to < 1h).
Phased Growth Planning
Phased procurement reduces upfront capital while maintaining expansion flexibility:
| Phase | Trigger | Action |
|---|---|---|
| Initial deployment | Day 1 | 60% of 3-year projected capacity |
| Phase 2 expansion | Queue utilization > 85% for 30 days | Add compute nodes to reach 80% of target |
| Phase 3 expansion | Queue utilization > 85% again | Add remaining nodes + storage expansion |
| Technology refresh | Year 4–5 | Replace oldest nodes with next-generation hardware |
The key constraint for expansion planning is network headroom: a 40-port InfiniBand switch with 32 nodes attached has 8 ports for future nodes. Design the switch tier to accommodate the full planned capacity at initial deployment.
Capacity planning is not a one-time exercise — it is an ongoing discipline. Queue metrics from SLURM (sreport, sacct) provide the empirical data to validate and refine the model over time. Contact Mevasis for capacity planning consulting and HPC cluster sizing analysis.