HPC Capacity Planning Guide: Compute, Memory, Storage, Network Sizing

Capacity planning for HPC is the process of matching infrastructure investment to workload requirements — both current and projected. Unlike web applications where horizontal scaling is immediate, HPC hardware has 18–24 month procurement cycles. Getting capacity wrong means either chronic queue backlogs or stranded capital in idle servers.

Core Capacity Formula

The foundational calculation for compute node count:

Required cores = (avg_jobs_per_day × avg_cores_per_job × avg_walltime_hours) 
                 ÷ 24 
                 × utilization_headroom_factor

utilization_headroom_factor = 1 / target_utilization

Example:
  200 jobs/day × 128 cores/job × 6h avg walltime ÷ 24h = 6,400 cores
  At target 70% utilization: 6,400 ÷ 0.7 = ~9,143 cores
  With 64-core nodes: 143 nodes minimum

Add 20–30% for growth over the planned hardware lifetime (typically 5 years). The actual number to procure should account for the expected year-2 demand, not just year-1.

Processor Architecture Comparison

Architecture	Best For	Cores/Socket	Memory Channels	Key Advantage
AMD EPYC 9654 (Genoa)	MPI simulation, genomics	96	12-channel DDR5	Core density, memory bandwidth
AMD EPYC 9755 (Turin)	High-throughput HPC	128	12-channel DDR5	Highest per-socket core count
Intel Xeon 6 Granite Rapids	MKL-optimized codes	128	8-channel DDR5	Single-thread perf, MKL ecosystem
Intel Xeon 6 Sierra Forest	High core count, moderate memory	144	8-channel DDR5	Power efficiency at scale
Ampere Altra	Power-constrained, cloud bursting	128	8-channel DDR4	Lowest power per core
Fujitsu A64FX	Bandwidth-bound simulation	48 + HBM2	HBM2 (native)	1 TB/s memory bandwidth

For new purchases in 2025–2026:

General HPC simulation (MPI-heavy): AMD EPYC 9004 (Genoa) or 9005 (Turin) series
MKL-dependent codes (ANSYS, MATLAB, Gaussian): Intel Xeon 6 Granite Rapids
Power-constrained environments: Ampere Altra

Memory Sizing by Workload Type

Memory requirements vary by more than an order of magnitude across HPC applications:

Workload	Memory per Core	Notes
Tight MPI simulation (OpenFOAM, NAMD)	4–8 GB	Distributed memory limits per-node requirement
Shared-memory simulation (ANSYS Mechanical)	8–32 GB	Entire model may reside in one node’s memory
Whole-genome assembly (SPAdes)	64–512 GB	Memory-intensive graph algorithms
Large-scale CFD (Fluent, STAR-CCM+)	8–16 GB	Mesh size determines memory
Monte Carlo simulation	1–4 GB	Embarrassingly parallel, small per-process footprint
Deep learning training	8–16 GB	GPU memory usually more critical
Seismic processing	4–8 GB	Data streaming reduces per-core requirement

Rule of thumb: Start with 4 GB per core as the minimum. If even 10% of your workload requires more than 16 GB/core, add dedicated high-memory nodes.

NUMA Impact on Memory Bandwidth

Modern multi-socket servers have NUMA (Non-Uniform Memory Access) topology. AMD EPYC 9654 with 12 CCDs has 12 NUMA domains — processes that access memory in a remote NUMA domain pay a 30–50% bandwidth penalty.

# Check NUMA topology
numactl --hardware
lscpu | grep NUMA

# NUMA-aware MPI binding example (EPYC 9654, 2 sockets, 12 NUMA each)
# 24 MPI ranks per node, one per NUMA domain
#SBATCH --ntasks-per-node=24
#SBATCH --cpus-per-task=8  # 4 cores per NUMA domain (96/24)

srun --cpu-bind=rank ./simulation

Applications that ignore NUMA topology on EPYC can see 30–50% lower memory bandwidth than expected. This is a frequent cause of underwhelming benchmark results on high-NUMA systems.

3-Tier Storage Architecture

Tier	Technology	Capacity	Throughput	Use Case
Hot (scratch)	NVMe SSD parallel FS	100–500 TB	50–500 GB/s	Active job I/O
Warm (project)	HDD parallel FS	500 TB–5 PB	5–50 GB/s	Group/user data
Cold (archive)	Tape or object storage	5–50 PB	0.5–5 GB/s	Long-term retention

Sizing scratch storage:

scratch_capacity = concurrent_jobs × max_job_data_size × 3
                 (3x for input + output + temporary files)

Example: 200 concurrent jobs × 2 TB/job × 3 = 1.2 PB minimum scratch

For genomics and seismic workloads where single jobs may generate 10–100 TB, scratch sizing is the dominant design decision.

Network Technology and Bandwidth

Technology	Bandwidth/Port	Latency	Best For
InfiniBand NDR	400 Gb/s	~0.6 µs	Large GPU clusters, tight MPI
InfiniBand HDR	200 Gb/s	~1 µs	Standard HPC, medium clusters
RoCE v2 (100 GbE)	100 Gb/s	~3 µs	Cost-sensitive, existing Ethernet
25 GbE Ethernet	25 Gb/s	~30 µs	Storage network, management
10 GbE Ethernet	10 Gb/s	~50 µs	Management only

Fat-tree oversubscription: A 1:1 (non-blocking) fat-tree requires equal uplink and downlink bandwidth at each aggregation tier. This is the maximum cost point. For most HPC workloads, 2:1 oversubscription is acceptable; only the tightest MPI workloads require non-blocking.

Non-blocking fat-tree (N nodes, each at 200 Gb/s):
  Leaf tier: N/ports_per_switch leaf switches
  Uplinks needed: N/2 × 200 Gb/s total
  Core switches: N/2 × 200 Gb/s / ports_per_core_switch

GPU Capacity Planning

GPU capacity planning adds two dimensions: GPU memory and inter-GPU bandwidth.

GPU memory per node:

min_GPU_memory_per_node ≥ max_model_size_in_bytes × precision_factor

LLaMA-3 70B in FP16: 70B × 2 bytes = 140 GB → 2× H100 80GB minimum
GPT-4 class (assumed 1.8T parameters) in FP8: 1.8T × 1 byte = 1.8 TB
  → requires ZeRO Stage 3 + CPU offloading on a GPU cluster

NVLink bandwidth for tensor parallelism:

Within a single DGX H100 node (8× H100 with NVLink4): 900 GB/s all-to-all Across nodes via InfiniBand NDR: 400 Gb/s = 50 GB/s per port

Tensor parallelism requires NVLink-grade bandwidth. Only implement tensor parallelism within a node (or DGX pod with NVLink switch). Pipeline parallelism works across nodes via InfiniBand.

Queue Analysis with Little’s Law

Little’s Law from queuing theory provides a sanity check for cluster sizing:

L = λ × W

Where:
  L = average number of jobs in system (running + queued)
  λ = job arrival rate (jobs/hour)
  W = average time in system (wait time + run time)

Example:
  λ = 50 jobs/hour
  W = 8h average run + 2h average wait = 10 hours
  L = 50 × 10 = 500 concurrent jobs in system

If average job requires 128 cores:
  Required capacity = 500 × 128 = 64,000 cores (at peak)

If your cluster has 32,000 cores and you observe L = 500 with W = 10h, queue wait time will be approximately 5h (Little’s Law confirms the cluster needs 2× expansion to reduce wait time to < 1h).

Phased Growth Planning

Phased procurement reduces upfront capital while maintaining expansion flexibility:

Phase	Trigger	Action
Initial deployment	Day 1	60% of 3-year projected capacity
Phase 2 expansion	Queue utilization > 85% for 30 days	Add compute nodes to reach 80% of target
Phase 3 expansion	Queue utilization > 85% again	Add remaining nodes + storage expansion
Technology refresh	Year 4–5	Replace oldest nodes with next-generation hardware

The key constraint for expansion planning is network headroom: a 40-port InfiniBand switch with 32 nodes attached has 8 ports for future nodes. Design the switch tier to accommodate the full planned capacity at initial deployment.

Capacity planning is not a one-time exercise — it is an ongoing discipline. Queue metrics from SLURM (sreport, sacct) provide the empirical data to validate and refine the model over time. Contact Mevasis for capacity planning consulting and HPC cluster sizing analysis.

HPC Capacity Planning: Core Formula, Processor Comparison, Storage Tiers, and Network