HPC Cluster Sizing Guide: How Many Nodes, RAM, Storage, and Network

The most expensive mistake in HPC procurement is buying the wrong size cluster — either too small for the workload (leading to chronic queue backlogs) or too large (wasting capital on idle capacity). Proper sizing starts with workload analysis, not hardware catalogs.

Step 1: Workload Analysis

Before specifying a single server, collect data on what will actually run on the cluster:

Application profile:

Which codes will run? (CFD: OpenFOAM, ANSYS; genomics: GATK, BWA; ML: PyTorch, TensorFlow)
Are they MPI-parallel, GPU-accelerated, single-threaded, or memory-bound?
What are typical input/output data sizes?

User and throughput requirements:

How many concurrent users?
Expected jobs per day and average job size (cores × hours)?
Target queue wait time (30 min? 4 hours? overnight?)
Peak vs. average demand ratio?

Growth trajectory:

Expected growth in user count and compute demand over 3–5 years?
Budget for initial deployment vs. expansion?

Without this data, cluster sizing is guesswork. Even rough estimates from pilot surveys or similar installations at peer institutions are better than none.

Step 2: Compute Node Count

Once you know the workload profile, calculate the required compute capacity:

Required total cores = (jobs_per_day × avg_cores_per_job × avg_walltime_hours) / 24 hours
                     × utilization_target_factor (typically 1.2 to 1.5)

For example: 100 jobs/day × 64 cores/job × 8 hours average walltime / 24 hours × 1.3 = ~2,773 cores. With 64-core nodes, that is 44 nodes minimum. Add nodes for the expected growth rate.

Node count also determines network topology. A 48-port switch can accommodate 48 compute nodes in a simple leaf configuration. Beyond that, a two-tier fat-tree is needed, which changes the cost structure significantly. Plan in multiples of switch port count.

Step 3: Memory per Node

Memory requirements are workload-specific:

Workload Type	Memory per Core
MPI parallel CFD / FEM	4–8 GB
Molecular dynamics (GROMACS, NAMD)	4–8 GB
Genome assembly (whole-genome)	16–64 GB
Deep learning (training)	8–16 GB
Monte Carlo simulation	2–4 GB
CFD with large mesh (Fluent, OpenFOAM)	8–16 GB

Modern dual-socket servers ship with 256 GB to 2 TB of RAM. For most HPC workloads, 256–512 GB per node is the sweet spot. If even a small fraction of jobs require > 512 GB, add dedicated high-memory nodes rather than populating all nodes with expensive large DIMMs.

NUMA topology matters: AMD EPYC 9004 (Genoa) has up to 12 NUMA domains per socket. Applications unaware of NUMA access remote memory at roughly half the bandwidth of local memory. Ensure MPI process binding matches NUMA topology.

Step 4: Storage Tier Design

HPC storage has three tiers with different performance and cost profiles:

Scratch (hot tier): Parallel filesystem (BeeGFS or Lustre) on NVMe SSDs. Used for active job I/O. Size: 50–100 TB per cluster is typical; more for genomics or seismic workloads. Throughput target: 10–50 GB/s aggregate.

Project/work (warm tier): Capacity storage on HDDs, parallel or NFS. Persistent user and group data. Size depends heavily on data retention policies — 500 TB to several PB.

Archive (cold tier): Tape library or object storage. Long-term retention at low cost per TB. Accessed infrequently; throughput is not critical.

Sizing rules of thumb:

Scratch: 10× the size of a typical job’s largest dataset
Project storage: 5 TB per active researcher (varies widely by domain)
Archive: 2–3× project storage, grows 30–50% per year

Step 5: Network Sizing

Network capacity must match compute and storage tiers:

MPI network (InfiniBand or RoCE):

Bandwidth: 1:1 oversubscription ideal, 2:1 acceptable for most HPC workloads
HDR200 (200 Gb/s) per port for standard HPC; NDR400 (400 Gb/s) for large GPU clusters
Switch count: N nodes / (switch_port_count / 2) for a non-blocking fat-tree

Storage network:

Must not share bandwidth with MPI network
25 GbE or 100 GbE from each storage server to a dedicated storage switch
Aggregate storage network bandwidth should exceed parallel filesystem target throughput

Management network:

1 GbE per node is sufficient
Out-of-band IPMI/BMC on separate management switch

Step 6: Redundancy and Reliability Planning

Reliability requirements drive additional cost:

Management node HA: Active-passive slurmctld with shared storage. Without this, a management node failure halts job scheduling.

Storage redundancy: RAID within storage nodes (RAID-6 for HDDs, RAID-1 for OS drives). BeeGFS Buddy Mirroring or Lustre OST mirroring for filesystem-level redundancy.

Network redundancy: Dual uplinks from each leaf switch to core switches. Dual HCAs on GPU nodes (essential for InfiniBand-dependent AI workloads).

Power: Dual PSU per server. UPS covering at minimum the management nodes and storage. Generator for extended outages if uptime SLAs require it.

Sizing Example: 64-Node Research Cluster

Component	Specification	Count	Notes
Login nodes	32c / 256 GB / 2×25GbE	2	Active-passive HA
Management node	16c / 128 GB / 2×10GbE	2	Active-passive HA
Compute nodes	128c / 512 GB / 1×HDR200	64	AMD EPYC 9754
GPU nodes	64c / 512 GB / 8×H100 / 1×NDR400	8	For ML workloads
BeeGFS storage	4-port NVMe / 32 TB per server	8	256 TB raw
InfiniBand switch	40-port HDR200	2	Non-blocking fat-tree
Storage switch	48-port 25GbE	1	Dedicated to BeeGFS
Management switch	48-port 1GbE	1	IPMI and admin

This configuration delivers approximately 8,192 compute cores, 640 GB GPU memory per GPU node, and ~200 GB/s aggregate storage bandwidth — suitable for a medium-sized research institution.

Cluster sizing done right prevents the two most common and costly outcomes: premature saturation that forces expensive emergency expansions, and over-specification that leaves capital idle. Contact Mevasis for a workload-driven sizing analysis and reference architecture.

HPC Cluster Sizing: Workload Analysis, Node Count, Storage, and Network Planning