HPC Cluster Setup Guide: From Hardware Selection to Software Stack — Mevasis — HPC Solutions

Building an HPC cluster — when planned correctly — creates an infrastructure investment that delivers years of high performance. Poor architectural decisions lead to problems that are costly to correct. This guide covers the steps required to build a small-to-medium HPC cluster (8–128 nodes) from the ground up.

Start Here: Workload Analysis

Before selecting hardware, answer these critical questions:

What applications will run? MPI-based simulation? GPU machine learning? Memory-intensive genomics?
What are average and peak utilization patterns? Sustained high load or periodic burst?
How many users and groups work concurrently?
What is the I/O profile? Checkpoint frequency, file sizes, read/write ratio?
What are security and compliance requirements? Isolated network, encryption, audit logging?

These answers directly determine hardware selection, network architecture, and storage design.

Layer 1: Hardware Selection

Compute Nodes

CPU node selection criteria:

Criterion	Recommendation
Core count	AMD EPYC 9004 (Genoa): 96–128 cores/socket; Intel Xeon Sapphire Rapids: 60 cores/socket
Memory	Minimum 4 GB/core; genomics/computational chemistry: 8–16 GB
Memory channels	EPYC: 12× DDR5; Xeon: 8×
PCIe lanes	PCIe 5.0 ×128+ if GPU expansion planned

GPU nodes:

NVIDIA HGX H100 (8× H100 SXM5, NVLink): Large-scale AI/ML or simulation
NVIDIA HGX A100 (8× A100): Lower cost for existing Ampere deployments
NVIDIA L40S (PCIe, 48 GB): Inference and mid-scale training at lower price point

ECC memory is mandatory. Undetected bit errors compromising scientific results are unacceptable in production HPC.

Management Nodes

At least two management (head/login) nodes for high availability:

Login node: User sessions, job submission, data transfer
Management node: slurmctld, DNS, LDAP/AD, monitoring
Storage node (separate or integrated): NFS or parallel filesystem

All management nodes must have OOB (Out-of-Band) management cards (IPMI/iDRAC/iLO) for remote access without physical presence.

Layer 2: Network Architecture

HPC networks typically consist of two separate fabrics:

Management Network (1GbE or 10GbE Ethernet)

OS provisioning, IPMI access, NFS, monitoring
All nodes included; isolated from compute network for security

High-Speed Compute Network

For MPI and parallel workloads:

≤ 32 nodes, budget-constrained: 25GbE or RoCE 100GbE
32–256 nodes, mid-range budget: InfiniBand HDR200
256+ nodes or latency-critical: InfiniBand NDR400 with fat-tree topology

Layer 3: Storage System

Storage is the most frequently underestimated HPC component. Insufficient storage performance leaves compute resources idle.

Storage Tiers

Tier	Technology	Purpose
Scratch (temporary)	NVMe SSD-based BeeGFS/Lustre	Active compute data
Home	NFS-backed NAS	User scripts, small files
Archive	High-capacity HDD or object storage	Completed project data

Parallel Filesystem Choice

Lustre: Industry standard for large deployments, 70% of TOP500 systems. High configuration complexity; requires specialist administration.

BeeGFS: Easier setup and management; ideal for mid-scale deployments (8–256 nodes). Built-in replication with Buddy Mirroring.

Minimum scratch storage targets:

Read: 10 GB/s (per 4,000-core cluster)
Write: 5 GB/s
Metadata: 50,000+ IOPS

Layer 4: Software Stack

Operating System

Rocky Linux 9 or AlmaLinux 9 are recommended for new deployments (CentOS 7 is EOL). Both are RHEL-based with enterprise support options.

Cluster Management: Warewulf 4

Node provisioning, PXE boot, and image management:

wwctl node add compute[01-32] --netdev eth0 --hwaddr AA:BB:CC:DD:EE:FF
wwctl node set compute[01-32] --container rocky-9-hpc

Job Scheduler: SLURM

# /etc/slurm/slurm.conf essentials
ClusterName=mycluster
SlurmctldHost=mgmt01
AuthType=auth/munge
MpiDefault=pmix

PartitionName=compute Nodes=compute[01-32] Default=YES MaxTime=72:00:00
PartitionName=gpu     Nodes=gpu[01-08]     Default=NO  MaxTime=48:00:00
PartitionName=debug   Nodes=compute[01-02] Default=NO  MaxTime=01:00:00

Module System: Lmod

Managing parallel application versions for users:

module load gcc/12.3 openmpi/4.1.5
module load python/3.11 cuda/12.3
module list
module avail

MPI Libraries

OpenMPI 5.x: General-purpose, good documentation, UCX/PMIx integration
MVAPICH2: Optimized for InfiniBand; strong GPU-MPI scenarios
Intel MPI: Good performance on Intel Xeon; free with oneAPI

Layer 5: Security

Network Segmentation

Internet
    ↓
[Firewall]
    ↓
[Login/DMZ Zone] — User SSH access
    ↓
[Management Network] — Admin traffic (isolated)
    ↓
[Compute Network] — HPC workloads (isolated)

Authentication

Centralized user management: LDAP/Active Directory integration
SSH key-based authentication; password login disabled
MFA (Multi-Factor Authentication) recommended for login nodes

Monitoring Stack

Prometheus + Grafana: Node and cluster metrics
DCGM Exporter: GPU health and performance
Slurm exporter: Queue and resource utilization

Typical Installation Timeline

Phase	Duration	Description
Procurement	6–14 weeks	Servers, switches, cable lead times
Physical installation	1–2 weeks	Rack, cabling, power
OS and provisioning	1 week	Warewulf, PXE boot, image config
Network configuration	1 week	Switches, InfiniBand subnet manager
SLURM and software	1–2 weeks	Scheduler, modules, applications
Testing and acceptance	1–2 weeks	Benchmarks, stress testing, user acceptance
Total	11–23 weeks

Mevasis HPC Installation Services

Mevasis provides turnkey HPC cluster installation services: needs analysis, hardware procurement, physical installation, software configuration, and go-live support. Technical support and maintenance services are also available post-installation.

Frequently Asked Questions

What node count qualifies as “small”? General convention: 1–32 nodes is small, 32–256 is medium, 256+ is large. Even small deployments benefit from parallel storage and high-speed interconnect.

Open source or commercial cluster management software? Warewulf (open source) and Bright Cluster Manager (commercial) are the most common options. Warewulf is sufficient for small-to-medium deployments; Bright’s GUI and support advantages are worth evaluating for large clusters with limited internal capacity.

Who should manage the software stack? Production environments require at least one dedicated HPC system administrator. Alternatively, managed service can be procured from specialist vendors like Mevasis.