HPC Cluster Architecture: Login, Compute, Storage Nodes Explained

An HPC cluster is not a pile of servers connected to a switch — it is a carefully layered system where each node type has a distinct role, and where the network and storage tiers are designed together with the compute layer. Understanding this architecture is the prerequisite for making good hardware selection, sizing, and operations decisions.

Node Types and Their Roles

SSH access from external networks
Job submission via sbatch, squeue, scancel
Interactive shells for small test runs and data preparation
Module environment loading (module load openmpi/4.1)
Access to the shared parallel filesystem

Login nodes must never be used for production computation. Running heavy workloads on login nodes degrades the experience for all users and can destabilize job submission infrastructure. Enforce this with CPU and memory ulimits on login nodes.

Sizing: 2–4 cores per expected concurrent user, 4–8 GB RAM per core, fast SSD for home directories. Redundancy (2 login nodes behind a load balancer) is strongly recommended.

Management Nodes

The management node (or controller node) runs cluster control services:

SLURM controller (slurmctld)
SLURM accounting database (slurmdbd)
LDAP or FreeIPA for user authentication
Monitoring stack (Prometheus, Alertmanager)
Configuration management (Ansible, Salt, Puppet)
Network boot and provisioning (xCAT, Warewulf)

Management nodes rarely need high CPU or RAM, but they must be highly available. A single management node failure that takes down slurmctld stops all new job scheduling. Run slurmctld in active-passive HA mode or at minimum ensure fast restore from backup.

Compute Nodes

Compute nodes are where the science happens. They run user jobs under SLURM control. Key design considerations:

CPU: AMD EPYC or Intel Xeon chosen for the workload — see the CPU selection guide for details.
Memory: Typically 4–8 GB per core for general HPC, 16–32 GB per core for memory-intensive applications (genomics, large CFD).
Local storage: Fast NVMe scratch per node for job-local temporary files (avoids parallel filesystem contention).
Network: InfiniBand HCA for MPI traffic; separate 1GbE or 10GbE for management.

All compute nodes should be as identical as possible to simplify provisioning, firmware management, and troubleshooting. Heterogeneous hardware is unavoidable as clusters grow, but each “generation” should be internally uniform.

GPU Nodes

GPU nodes are a specialized subclass of compute nodes with one or more accelerators (NVIDIA H100, A100, L40S). They require:

PCIe Gen4/5 or NVLink for GPU-to-GPU communication within the node
InfiniBand for GPU-to-GPU communication across nodes (via NCCL/RDMA)
CUDA toolkit and GPU-aware MPI libraries
DCGM for monitoring

SLURM manages GPUs as Generic Resources (GRES). Users request GPUs explicitly: #SBATCH --gres=gpu:a100:2.

Storage Nodes

Storage nodes run the parallel filesystem services (BeeGFS, Lustre) or NFS/SMB for home directories. Key distinctions:

Scratch/work storage: Parallel filesystem on high-speed NVMe or SAS SSDs, mounted on all compute nodes. For short-lived job I/O.
Project/group storage: Capacity storage on HDDs for persistent data, managed with quotas.
Archive storage: Tape library or object storage for long-term data retention.

Network Architecture

A production HPC cluster uses at least three separate network fabrics:

Management network (1GbE or 10GbE Ethernet): IPMI/BMC out-of-band management, PXE boot, OS installation. Must remain functional even when compute network fails.

Compute/MPI network (InfiniBand or RoCE): All MPI inter-process communication. Low latency (< 2 µs) and high bandwidth (100–400 Gb/s) are essential. Never shared with storage or management traffic.

Storage network (10GbE or 25GbE Ethernet, or InfiniBand): BeeGFS or Lustre I/O traffic. Dedicated to prevent storage contention from affecting MPI performance.

Example SLURM Job Script

#!/bin/bash
#SBATCH --job-name=cfd_run
#SBATCH --partition=compute
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=64
#SBATCH --mem=256G
#SBATCH --time=24:00:00
#SBATCH --output=cfd_%j.out
#SBATCH --error=cfd_%j.err

# Load software environment
module load openmpi/4.1.6
module load openfoam/v2312

# Use local NVMe scratch for temporary files
export TMPDIR=/scratch/$SLURM_JOB_ID
mkdir -p $TMPDIR

# Run CFD simulation
cd $SLURM_SUBMIT_DIR
mpirun -np 512 simpleFoam -parallel

# Clean up local scratch
rm -rf $TMPDIR

Architectural Design Principles

Separation of concerns: Never mix management, compute, and storage traffic on the same physical network. Interference between traffic types leads to unpredictable performance degradation.

Homogeneity within generations: Identical hardware within each generation simplifies firmware management, performance debugging, and node replacement. Mixing CPU generations in a single partition creates unfair job scheduling.

Plan for failure: Storage nodes and management nodes need HA configurations. Compute node failures are expected — SLURM handles them automatically with node drain and job requeue.

Scale-out, not scale-up: HPC performance scales better by adding more nodes than by buying larger individual servers. Design the network and storage to support expansion without redesign.

Baseline monitoring from day one: Deploy Prometheus, Grafana, and SLURM exporter before the first production job. Debugging performance problems without historical metrics is an order of magnitude harder.

Every architectural decision in an HPC cluster has downstream consequences on performance, cost, and operational complexity. Mevasis applies this architectural framework on every cluster deployment. Visit our HPC Cluster Solutions page or contact us to discuss your specific requirements.

HPC Cluster Architecture: Node Roles, Network Layers, and Design Principles