Building an HPC cluster — when planned correctly — creates an infrastructure investment that delivers years of high performance. Poor architectural decisions lead to problems that are costly to correct. This guide covers the steps required to build a small-to-medium HPC cluster (8–128 nodes) from the ground up.
Start Here: Workload Analysis
Before selecting hardware, answer these critical questions:
- What applications will run? MPI-based simulation? GPU machine learning? Memory-intensive genomics?
- What are average and peak utilization patterns? Sustained high load or periodic burst?
- How many users and groups work concurrently?
- What is the I/O profile? Checkpoint frequency, file sizes, read/write ratio?
- What are security and compliance requirements? Isolated network, encryption, audit logging?
These answers directly determine hardware selection, network architecture, and storage design.
Layer 1: Hardware Selection
Compute Nodes
CPU node selection criteria:
| Criterion | Recommendation |
|---|---|
| Core count | AMD EPYC 9004 (Genoa): 96–128 cores/socket; Intel Xeon Sapphire Rapids: 60 cores/socket |
| Memory | Minimum 4 GB/core; genomics/computational chemistry: 8–16 GB |
| Memory channels | EPYC: 12× DDR5; Xeon: 8× |
| PCIe lanes | PCIe 5.0 ×128+ if GPU expansion planned |
GPU nodes:
- NVIDIA HGX H100 (8× H100 SXM5, NVLink): Large-scale AI/ML or simulation
- NVIDIA HGX A100 (8× A100): Lower cost for existing Ampere deployments
- NVIDIA L40S (PCIe, 48 GB): Inference and mid-scale training at lower price point
ECC memory is mandatory. Undetected bit errors compromising scientific results are unacceptable in production HPC.
Management Nodes
At least two management (head/login) nodes for high availability:
- Login node: User sessions, job submission, data transfer
- Management node: slurmctld, DNS, LDAP/AD, monitoring
- Storage node (separate or integrated): NFS or parallel filesystem
All management nodes must have OOB (Out-of-Band) management cards (IPMI/iDRAC/iLO) for remote access without physical presence.
Layer 2: Network Architecture
HPC networks typically consist of two separate fabrics:
Management Network (1GbE or 10GbE Ethernet)
- OS provisioning, IPMI access, NFS, monitoring
- All nodes included; isolated from compute network for security
High-Speed Compute Network
For MPI and parallel workloads:
- ≤ 32 nodes, budget-constrained: 25GbE or RoCE 100GbE
- 32–256 nodes, mid-range budget: InfiniBand HDR200
- 256+ nodes or latency-critical: InfiniBand NDR400 with fat-tree topology
Layer 3: Storage System
Storage is the most frequently underestimated HPC component. Insufficient storage performance leaves compute resources idle.
Storage Tiers
| Tier | Technology | Purpose |
|---|---|---|
| Scratch (temporary) | NVMe SSD-based BeeGFS/Lustre | Active compute data |
| Home | NFS-backed NAS | User scripts, small files |
| Archive | High-capacity HDD or object storage | Completed project data |
Parallel Filesystem Choice
Lustre: Industry standard for large deployments, 70% of TOP500 systems. High configuration complexity; requires specialist administration.
BeeGFS: Easier setup and management; ideal for mid-scale deployments (8–256 nodes). Built-in replication with Buddy Mirroring.
Minimum scratch storage targets:
- Read: 10 GB/s (per 4,000-core cluster)
- Write: 5 GB/s
- Metadata: 50,000+ IOPS
Layer 4: Software Stack
Operating System
Rocky Linux 9 or AlmaLinux 9 are recommended for new deployments (CentOS 7 is EOL). Both are RHEL-based with enterprise support options.
Cluster Management: Warewulf 4
Node provisioning, PXE boot, and image management:
wwctl node add compute[01-32] --netdev eth0 --hwaddr AA:BB:CC:DD:EE:FF
wwctl node set compute[01-32] --container rocky-9-hpc
Job Scheduler: SLURM
# /etc/slurm/slurm.conf essentials
ClusterName=mycluster
SlurmctldHost=mgmt01
AuthType=auth/munge
MpiDefault=pmix
PartitionName=compute Nodes=compute[01-32] Default=YES MaxTime=72:00:00
PartitionName=gpu Nodes=gpu[01-08] Default=NO MaxTime=48:00:00
PartitionName=debug Nodes=compute[01-02] Default=NO MaxTime=01:00:00
Module System: Lmod
Managing parallel application versions for users:
module load gcc/12.3 openmpi/4.1.5
module load python/3.11 cuda/12.3
module list
module avail
MPI Libraries
- OpenMPI 5.x: General-purpose, good documentation, UCX/PMIx integration
- MVAPICH2: Optimized for InfiniBand; strong GPU-MPI scenarios
- Intel MPI: Good performance on Intel Xeon; free with oneAPI
Layer 5: Security
Network Segmentation
Internet
↓
[Firewall]
↓
[Login/DMZ Zone] — User SSH access
↓
[Management Network] — Admin traffic (isolated)
↓
[Compute Network] — HPC workloads (isolated)
Authentication
- Centralized user management: LDAP/Active Directory integration
- SSH key-based authentication; password login disabled
- MFA (Multi-Factor Authentication) recommended for login nodes
Monitoring Stack
- Prometheus + Grafana: Node and cluster metrics
- DCGM Exporter: GPU health and performance
- Slurm exporter: Queue and resource utilization
Typical Installation Timeline
| Phase | Duration | Description |
|---|---|---|
| Procurement | 6–14 weeks | Servers, switches, cable lead times |
| Physical installation | 1–2 weeks | Rack, cabling, power |
| OS and provisioning | 1 week | Warewulf, PXE boot, image config |
| Network configuration | 1 week | Switches, InfiniBand subnet manager |
| SLURM and software | 1–2 weeks | Scheduler, modules, applications |
| Testing and acceptance | 1–2 weeks | Benchmarks, stress testing, user acceptance |
| Total | 11–23 weeks |
Mevasis HPC Installation Services
Mevasis provides turnkey HPC cluster installation services: needs analysis, hardware procurement, physical installation, software configuration, and go-live support. Technical support and maintenance services are also available post-installation.
Frequently Asked Questions
What node count qualifies as “small”? General convention: 1–32 nodes is small, 32–256 is medium, 256+ is large. Even small deployments benefit from parallel storage and high-speed interconnect.
Open source or commercial cluster management software? Warewulf (open source) and Bright Cluster Manager (commercial) are the most common options. Warewulf is sufficient for small-to-medium deployments; Bright’s GUI and support advantages are worth evaluating for large clusters with limited internal capacity.
Who should manage the software stack? Production environments require at least one dedicated HPC system administrator. Alternatively, managed service can be procured from specialist vendors like Mevasis.