HPC Cluster Sizing: Workload Analysis, Node Count, Storage, and Network Planning
How to size an HPC cluster: workload analysis methodology, compute node count, memory requirements, storage tier design, network capacity, and redundancy planning.
The most expensive mistake in HPC procurement is buying the wrong size cluster — either too small for the workload (leading to chronic queue backlogs) or too large (wasting capital on idle capacity). Proper sizing starts with workload analysis, not hardware catalogs.
Step 1: Workload Analysis
Before specifying a single server, collect data on what will actually run on the cluster:
Application profile:
- Which codes will run? (CFD: OpenFOAM, ANSYS; genomics: GATK, BWA; ML: PyTorch, TensorFlow)
- Are they MPI-parallel, GPU-accelerated, single-threaded, or memory-bound?
- What are typical input/output data sizes?
User and throughput requirements:
- How many concurrent users?
- Expected jobs per day and average job size (cores × hours)?
- Target queue wait time (30 min? 4 hours? overnight?)
- Peak vs. average demand ratio?
Growth trajectory:
- Expected growth in user count and compute demand over 3–5 years?
- Budget for initial deployment vs. expansion?
Without this data, cluster sizing is guesswork. Even rough estimates from pilot surveys or similar installations at peer institutions are better than none.
Step 2: Compute Node Count
Once you know the workload profile, calculate the required compute capacity:
Required total cores = (jobs_per_day × avg_cores_per_job × avg_walltime_hours) / 24 hours
× utilization_target_factor (typically 1.2 to 1.5)
For example: 100 jobs/day × 64 cores/job × 8 hours average walltime / 24 hours × 1.3 = ~2,773 cores. With 64-core nodes, that is 44 nodes minimum. Add nodes for the expected growth rate.
Node count also determines network topology. A 48-port switch can accommodate 48 compute nodes in a simple leaf configuration. Beyond that, a two-tier fat-tree is needed, which changes the cost structure significantly. Plan in multiples of switch port count.
Step 3: Memory per Node
Memory requirements are workload-specific:
| Workload Type | Memory per Core |
|---|---|
| MPI parallel CFD / FEM | 4–8 GB |
| Molecular dynamics (GROMACS, NAMD) | 4–8 GB |
| Genome assembly (whole-genome) | 16–64 GB |
| Deep learning (training) | 8–16 GB |
| Monte Carlo simulation | 2–4 GB |
| CFD with large mesh (Fluent, OpenFOAM) | 8–16 GB |
Modern dual-socket servers ship with 256 GB to 2 TB of RAM. For most HPC workloads, 256–512 GB per node is the sweet spot. If even a small fraction of jobs require > 512 GB, add dedicated high-memory nodes rather than populating all nodes with expensive large DIMMs.
NUMA topology matters: AMD EPYC 9004 (Genoa) has up to 12 NUMA domains per socket. Applications unaware of NUMA access remote memory at roughly half the bandwidth of local memory. Ensure MPI process binding matches NUMA topology.
Step 4: Storage Tier Design
HPC storage has three tiers with different performance and cost profiles:
Scratch (hot tier): Parallel filesystem (BeeGFS or Lustre) on NVMe SSDs. Used for active job I/O. Size: 50–100 TB per cluster is typical; more for genomics or seismic workloads. Throughput target: 10–50 GB/s aggregate.
Project/work (warm tier): Capacity storage on HDDs, parallel or NFS. Persistent user and group data. Size depends heavily on data retention policies — 500 TB to several PB.
Archive (cold tier): Tape library or object storage. Long-term retention at low cost per TB. Accessed infrequently; throughput is not critical.
Sizing rules of thumb:
- Scratch: 10× the size of a typical job’s largest dataset
- Project storage: 5 TB per active researcher (varies widely by domain)
- Archive: 2–3× project storage, grows 30–50% per year
Step 5: Network Sizing
Network capacity must match compute and storage tiers:
MPI network (InfiniBand or RoCE):
- Bandwidth: 1:1 oversubscription ideal, 2:1 acceptable for most HPC workloads
- HDR200 (200 Gb/s) per port for standard HPC; NDR400 (400 Gb/s) for large GPU clusters
- Switch count: N nodes / (switch_port_count / 2) for a non-blocking fat-tree
Storage network:
- Must not share bandwidth with MPI network
- 25 GbE or 100 GbE from each storage server to a dedicated storage switch
- Aggregate storage network bandwidth should exceed parallel filesystem target throughput
Management network:
- 1 GbE per node is sufficient
- Out-of-band IPMI/BMC on separate management switch
Step 6: Redundancy and Reliability Planning
Reliability requirements drive additional cost:
Management node HA: Active-passive slurmctld with shared storage. Without this, a management node failure halts job scheduling.
Storage redundancy: RAID within storage nodes (RAID-6 for HDDs, RAID-1 for OS drives). BeeGFS Buddy Mirroring or Lustre OST mirroring for filesystem-level redundancy.
Network redundancy: Dual uplinks from each leaf switch to core switches. Dual HCAs on GPU nodes (essential for InfiniBand-dependent AI workloads).
Power: Dual PSU per server. UPS covering at minimum the management nodes and storage. Generator for extended outages if uptime SLAs require it.
Sizing Example: 64-Node Research Cluster
| Component | Specification | Count | Notes |
|---|---|---|---|
| Login nodes | 32c / 256 GB / 2×25GbE | 2 | Active-passive HA |
| Management node | 16c / 128 GB / 2×10GbE | 2 | Active-passive HA |
| Compute nodes | 128c / 512 GB / 1×HDR200 | 64 | AMD EPYC 9754 |
| GPU nodes | 64c / 512 GB / 8×H100 / 1×NDR400 | 8 | For ML workloads |
| BeeGFS storage | 4-port NVMe / 32 TB per server | 8 | 256 TB raw |
| InfiniBand switch | 40-port HDR200 | 2 | Non-blocking fat-tree |
| Storage switch | 48-port 25GbE | 1 | Dedicated to BeeGFS |
| Management switch | 48-port 1GbE | 1 | IPMI and admin |
This configuration delivers approximately 8,192 compute cores, 640 GB GPU memory per GPU node, and ~200 GB/s aggregate storage bandwidth — suitable for a medium-sized research institution.
Cluster sizing done right prevents the two most common and costly outcomes: premature saturation that forces expensive emergency expansions, and over-specification that leaves capital idle. Contact Mevasis for a workload-driven sizing analysis and reference architecture.