/ Blog

HPC Cluster Setup Guide: From Hardware Selection to Software Stack

Step-by-step guide to building an HPC cluster: hardware selection, network design, storage architecture, operating system, and software stack configuration.

Building an HPC cluster — when planned correctly — creates an infrastructure investment that delivers years of high performance. Poor architectural decisions lead to problems that are costly to correct. This guide covers the steps required to build a small-to-medium HPC cluster (8–128 nodes) from the ground up.

Start Here: Workload Analysis

Before selecting hardware, answer these critical questions:

  • What applications will run? MPI-based simulation? GPU machine learning? Memory-intensive genomics?
  • What are average and peak utilization patterns? Sustained high load or periodic burst?
  • How many users and groups work concurrently?
  • What is the I/O profile? Checkpoint frequency, file sizes, read/write ratio?
  • What are security and compliance requirements? Isolated network, encryption, audit logging?

These answers directly determine hardware selection, network architecture, and storage design.

Layer 1: Hardware Selection

Compute Nodes

CPU node selection criteria:

CriterionRecommendation
Core countAMD EPYC 9004 (Genoa): 96–128 cores/socket; Intel Xeon Sapphire Rapids: 60 cores/socket
MemoryMinimum 4 GB/core; genomics/computational chemistry: 8–16 GB
Memory channelsEPYC: 12× DDR5; Xeon: 8×
PCIe lanesPCIe 5.0 ×128+ if GPU expansion planned

GPU nodes:

  • NVIDIA HGX H100 (8× H100 SXM5, NVLink): Large-scale AI/ML or simulation
  • NVIDIA HGX A100 (8× A100): Lower cost for existing Ampere deployments
  • NVIDIA L40S (PCIe, 48 GB): Inference and mid-scale training at lower price point

ECC memory is mandatory. Undetected bit errors compromising scientific results are unacceptable in production HPC.

Management Nodes

At least two management (head/login) nodes for high availability:

  • Login node: User sessions, job submission, data transfer
  • Management node: slurmctld, DNS, LDAP/AD, monitoring
  • Storage node (separate or integrated): NFS or parallel filesystem

All management nodes must have OOB (Out-of-Band) management cards (IPMI/iDRAC/iLO) for remote access without physical presence.

Layer 2: Network Architecture

HPC networks typically consist of two separate fabrics:

Management Network (1GbE or 10GbE Ethernet)

  • OS provisioning, IPMI access, NFS, monitoring
  • All nodes included; isolated from compute network for security

High-Speed Compute Network

For MPI and parallel workloads:

  • ≤ 32 nodes, budget-constrained: 25GbE or RoCE 100GbE
  • 32–256 nodes, mid-range budget: InfiniBand HDR200
  • 256+ nodes or latency-critical: InfiniBand NDR400 with fat-tree topology

Layer 3: Storage System

Storage is the most frequently underestimated HPC component. Insufficient storage performance leaves compute resources idle.

Storage Tiers

TierTechnologyPurpose
Scratch (temporary)NVMe SSD-based BeeGFS/LustreActive compute data
HomeNFS-backed NASUser scripts, small files
ArchiveHigh-capacity HDD or object storageCompleted project data

Parallel Filesystem Choice

Lustre: Industry standard for large deployments, 70% of TOP500 systems. High configuration complexity; requires specialist administration.

BeeGFS: Easier setup and management; ideal for mid-scale deployments (8–256 nodes). Built-in replication with Buddy Mirroring.

Minimum scratch storage targets:

  • Read: 10 GB/s (per 4,000-core cluster)
  • Write: 5 GB/s
  • Metadata: 50,000+ IOPS

Layer 4: Software Stack

Operating System

Rocky Linux 9 or AlmaLinux 9 are recommended for new deployments (CentOS 7 is EOL). Both are RHEL-based with enterprise support options.

Cluster Management: Warewulf 4

Node provisioning, PXE boot, and image management:

wwctl node add compute[01-32] --netdev eth0 --hwaddr AA:BB:CC:DD:EE:FF
wwctl node set compute[01-32] --container rocky-9-hpc

Job Scheduler: SLURM

# /etc/slurm/slurm.conf essentials
ClusterName=mycluster
SlurmctldHost=mgmt01
AuthType=auth/munge
MpiDefault=pmix

PartitionName=compute Nodes=compute[01-32] Default=YES MaxTime=72:00:00
PartitionName=gpu     Nodes=gpu[01-08]     Default=NO  MaxTime=48:00:00
PartitionName=debug   Nodes=compute[01-02] Default=NO  MaxTime=01:00:00

Module System: Lmod

Managing parallel application versions for users:

module load gcc/12.3 openmpi/4.1.5
module load python/3.11 cuda/12.3
module list
module avail

MPI Libraries

  • OpenMPI 5.x: General-purpose, good documentation, UCX/PMIx integration
  • MVAPICH2: Optimized for InfiniBand; strong GPU-MPI scenarios
  • Intel MPI: Good performance on Intel Xeon; free with oneAPI

Layer 5: Security

Network Segmentation

Internet
    ↓
[Firewall]
    ↓
[Login/DMZ Zone] — User SSH access
    ↓
[Management Network] — Admin traffic (isolated)
    ↓
[Compute Network] — HPC workloads (isolated)

Authentication

  • Centralized user management: LDAP/Active Directory integration
  • SSH key-based authentication; password login disabled
  • MFA (Multi-Factor Authentication) recommended for login nodes

Monitoring Stack

  • Prometheus + Grafana: Node and cluster metrics
  • DCGM Exporter: GPU health and performance
  • Slurm exporter: Queue and resource utilization

Typical Installation Timeline

PhaseDurationDescription
Procurement6–14 weeksServers, switches, cable lead times
Physical installation1–2 weeksRack, cabling, power
OS and provisioning1 weekWarewulf, PXE boot, image config
Network configuration1 weekSwitches, InfiniBand subnet manager
SLURM and software1–2 weeksScheduler, modules, applications
Testing and acceptance1–2 weeksBenchmarks, stress testing, user acceptance
Total11–23 weeks

Mevasis HPC Installation Services

Mevasis provides turnkey HPC cluster installation services: needs analysis, hardware procurement, physical installation, software configuration, and go-live support. Technical support and maintenance services are also available post-installation.


Frequently Asked Questions

What node count qualifies as “small”? General convention: 1–32 nodes is small, 32–256 is medium, 256+ is large. Even small deployments benefit from parallel storage and high-speed interconnect.

Open source or commercial cluster management software? Warewulf (open source) and Bright Cluster Manager (commercial) are the most common options. Warewulf is sufficient for small-to-medium deployments; Bright’s GUI and support advantages are worth evaluating for large clusters with limited internal capacity.

Who should manage the software stack? Production environments require at least one dedicated HPC system administrator. Alternatively, managed service can be procured from specialist vendors like Mevasis.