/ Blog

HPC Capacity Planning: Core Formula, Processor Comparison, Storage Tiers, and Network

HPC capacity planning methodology: core capacity formula, Intel vs AMD vs ARM processor comparison, memory sizing by workload type, NUMA impact, 3-tier storage architecture, network technology table, fat-tree oversubscription, GPU capacity planning, Little's Law for queue analysis, and phased growth planning.

Capacity planning for HPC is the process of matching infrastructure investment to workload requirements — both current and projected. Unlike web applications where horizontal scaling is immediate, HPC hardware has 18–24 month procurement cycles. Getting capacity wrong means either chronic queue backlogs or stranded capital in idle servers.

Core Capacity Formula

The foundational calculation for compute node count:

Required cores = (avg_jobs_per_day × avg_cores_per_job × avg_walltime_hours) 
                 ÷ 24 
                 × utilization_headroom_factor

utilization_headroom_factor = 1 / target_utilization

Example:
  200 jobs/day × 128 cores/job × 6h avg walltime ÷ 24h = 6,400 cores
  At target 70% utilization: 6,400 ÷ 0.7 = ~9,143 cores
  With 64-core nodes: 143 nodes minimum

Add 20–30% for growth over the planned hardware lifetime (typically 5 years). The actual number to procure should account for the expected year-2 demand, not just year-1.

Processor Architecture Comparison

ArchitectureBest ForCores/SocketMemory ChannelsKey Advantage
AMD EPYC 9654 (Genoa)MPI simulation, genomics9612-channel DDR5Core density, memory bandwidth
AMD EPYC 9755 (Turin)High-throughput HPC12812-channel DDR5Highest per-socket core count
Intel Xeon 6 Granite RapidsMKL-optimized codes1288-channel DDR5Single-thread perf, MKL ecosystem
Intel Xeon 6 Sierra ForestHigh core count, moderate memory1448-channel DDR5Power efficiency at scale
Ampere AltraPower-constrained, cloud bursting1288-channel DDR4Lowest power per core
Fujitsu A64FXBandwidth-bound simulation48 + HBM2HBM2 (native)1 TB/s memory bandwidth

For new purchases in 2025–2026:

  • General HPC simulation (MPI-heavy): AMD EPYC 9004 (Genoa) or 9005 (Turin) series
  • MKL-dependent codes (ANSYS, MATLAB, Gaussian): Intel Xeon 6 Granite Rapids
  • Power-constrained environments: Ampere Altra

Memory Sizing by Workload Type

Memory requirements vary by more than an order of magnitude across HPC applications:

WorkloadMemory per CoreNotes
Tight MPI simulation (OpenFOAM, NAMD)4–8 GBDistributed memory limits per-node requirement
Shared-memory simulation (ANSYS Mechanical)8–32 GBEntire model may reside in one node’s memory
Whole-genome assembly (SPAdes)64–512 GBMemory-intensive graph algorithms
Large-scale CFD (Fluent, STAR-CCM+)8–16 GBMesh size determines memory
Monte Carlo simulation1–4 GBEmbarrassingly parallel, small per-process footprint
Deep learning training8–16 GBGPU memory usually more critical
Seismic processing4–8 GBData streaming reduces per-core requirement

Rule of thumb: Start with 4 GB per core as the minimum. If even 10% of your workload requires more than 16 GB/core, add dedicated high-memory nodes.

NUMA Impact on Memory Bandwidth

Modern multi-socket servers have NUMA (Non-Uniform Memory Access) topology. AMD EPYC 9654 with 12 CCDs has 12 NUMA domains — processes that access memory in a remote NUMA domain pay a 30–50% bandwidth penalty.

# Check NUMA topology
numactl --hardware
lscpu | grep NUMA

# NUMA-aware MPI binding example (EPYC 9654, 2 sockets, 12 NUMA each)
# 24 MPI ranks per node, one per NUMA domain
#SBATCH --ntasks-per-node=24
#SBATCH --cpus-per-task=8  # 4 cores per NUMA domain (96/24)

srun --cpu-bind=rank ./simulation

Applications that ignore NUMA topology on EPYC can see 30–50% lower memory bandwidth than expected. This is a frequent cause of underwhelming benchmark results on high-NUMA systems.

3-Tier Storage Architecture

TierTechnologyCapacityThroughputUse Case
Hot (scratch)NVMe SSD parallel FS100–500 TB50–500 GB/sActive job I/O
Warm (project)HDD parallel FS500 TB–5 PB5–50 GB/sGroup/user data
Cold (archive)Tape or object storage5–50 PB0.5–5 GB/sLong-term retention

Sizing scratch storage:

scratch_capacity = concurrent_jobs × max_job_data_size × 3
                 (3x for input + output + temporary files)

Example: 200 concurrent jobs × 2 TB/job × 3 = 1.2 PB minimum scratch

For genomics and seismic workloads where single jobs may generate 10–100 TB, scratch sizing is the dominant design decision.

Network Technology and Bandwidth

TechnologyBandwidth/PortLatencyBest For
InfiniBand NDR400 Gb/s~0.6 µsLarge GPU clusters, tight MPI
InfiniBand HDR200 Gb/s~1 µsStandard HPC, medium clusters
RoCE v2 (100 GbE)100 Gb/s~3 µsCost-sensitive, existing Ethernet
25 GbE Ethernet25 Gb/s~30 µsStorage network, management
10 GbE Ethernet10 Gb/s~50 µsManagement only

Fat-tree oversubscription: A 1:1 (non-blocking) fat-tree requires equal uplink and downlink bandwidth at each aggregation tier. This is the maximum cost point. For most HPC workloads, 2:1 oversubscription is acceptable; only the tightest MPI workloads require non-blocking.

Non-blocking fat-tree (N nodes, each at 200 Gb/s):
  Leaf tier: N/ports_per_switch leaf switches
  Uplinks needed: N/2 × 200 Gb/s total
  Core switches: N/2 × 200 Gb/s / ports_per_core_switch

GPU Capacity Planning

GPU capacity planning adds two dimensions: GPU memory and inter-GPU bandwidth.

GPU memory per node:

min_GPU_memory_per_node ≥ max_model_size_in_bytes × precision_factor

LLaMA-3 70B in FP16: 70B × 2 bytes = 140 GB → 2× H100 80GB minimum
GPT-4 class (assumed 1.8T parameters) in FP8: 1.8T × 1 byte = 1.8 TB
  → requires ZeRO Stage 3 + CPU offloading on a GPU cluster

NVLink bandwidth for tensor parallelism:

Within a single DGX H100 node (8× H100 with NVLink4): 900 GB/s all-to-all Across nodes via InfiniBand NDR: 400 Gb/s = 50 GB/s per port

Tensor parallelism requires NVLink-grade bandwidth. Only implement tensor parallelism within a node (or DGX pod with NVLink switch). Pipeline parallelism works across nodes via InfiniBand.

Queue Analysis with Little’s Law

Little’s Law from queuing theory provides a sanity check for cluster sizing:

L = λ × W

Where:
  L = average number of jobs in system (running + queued)
  λ = job arrival rate (jobs/hour)
  W = average time in system (wait time + run time)

Example:
  λ = 50 jobs/hour
  W = 8h average run + 2h average wait = 10 hours
  L = 50 × 10 = 500 concurrent jobs in system

If average job requires 128 cores:
  Required capacity = 500 × 128 = 64,000 cores (at peak)

If your cluster has 32,000 cores and you observe L = 500 with W = 10h, queue wait time will be approximately 5h (Little’s Law confirms the cluster needs 2× expansion to reduce wait time to < 1h).

Phased Growth Planning

Phased procurement reduces upfront capital while maintaining expansion flexibility:

PhaseTriggerAction
Initial deploymentDay 160% of 3-year projected capacity
Phase 2 expansionQueue utilization > 85% for 30 daysAdd compute nodes to reach 80% of target
Phase 3 expansionQueue utilization > 85% againAdd remaining nodes + storage expansion
Technology refreshYear 4–5Replace oldest nodes with next-generation hardware

The key constraint for expansion planning is network headroom: a 40-port InfiniBand switch with 32 nodes attached has 8 ports for future nodes. Design the switch tier to accommodate the full planned capacity at initial deployment.


Capacity planning is not a one-time exercise — it is an ongoing discipline. Queue metrics from SLURM (sreport, sacct) provide the empirical data to validate and refine the model over time. Contact Mevasis for capacity planning consulting and HPC cluster sizing analysis.