CPU Cluster Technical Guide: AMD EPYC vs Intel Xeon, SLURM Configuration, and Workloads
Technical guide for CPU-based HPC clusters: AMD EPYC vs Intel Xeon comparison, suitable workloads (MPI simulations, CFD, Monte Carlo), SLURM configuration, InfiniBand, BeeGFS vs Lustre, NUMA optimization, and benchmark validation.
While GPU clusters dominate AI/ML workloads, the vast majority of scientific HPC jobs still run on CPU clusters: fluid dynamics simulations, finite element analysis, Monte Carlo methods, bioinformatics pipelines, quantum chemistry calculations. CPU clusters remain the workhorse of research computing, and choosing the right processor architecture for your workload has a significant impact on both performance and cost.
Processor Architecture Comparison
AMD EPYC (Genoa/Turin Architecture)
AMD EPYC 9004 series (Genoa) and 9005 series (Turin) have become the dominant choice for new HPC deployments:
- Up to 192 cores per socket (9965 / Turin)
- DDR5 memory with 12-channel memory controller
- 128 PCIe Gen5 lanes per socket
- CCD (Core Complex Die) chiplet architecture with 12 CCDs per Genoa die
- NUMA topology: up to 12 NUMA domains per socket
The high core count makes EPYC attractive for workloads that scale well across many cores. The chiplet architecture means NUMA effects are more pronounced than on monolithic designs — numactl binding is essential.
Best for: MPI simulations with many independent processes, Monte Carlo sampling, bioinformatics (BWA, GATK), EDA/CAE licensing that charges per socket.
Intel Xeon Scalable (Sapphire Rapids / Emerald Rapids)
Intel Xeon 6 (Sierra Forest and Granite Rapids) continues to hold advantages in specific workloads:
- Up to 128 cores per socket (Granite Rapids)
- Integrated AMX (Advanced Matrix Extensions) for AI inference on CPUs
- Strong single-threaded performance
- Mature ecosystem with Intel oneAPI, MKL, ICC
Best for: Applications that rely on Intel MKL (BLAS, LAPACK, FFT), workloads with strong single-thread performance requirements, mixed HPC/AI inference workloads.
ARM-based HPC (Ampere Altra, AWS Graviton3, Fujitsu A64FX)
ARM processors are increasingly viable for HPC:
- Ampere Altra: 128 cores per socket, low power, single-NUMA simplicity
- AWS Graviton3: strong HPC performance in cloud burst scenarios
- Fujitsu A64FX: 512-bit SVE SIMD, native HBM2 memory, used in Fugaku
Best for: Power-constrained deployments, cloud bursting, applications with portable vectorization.
Workloads Well-Suited for CPU Clusters
| Workload | CPU Architecture | Key Optimization |
|---|---|---|
| CFD (OpenFOAM, Fluent, STAR-CCM+) | EPYC/Xeon | MPI binding, InfiniBand, large memory |
| FEM (ANSYS Mechanical, LS-DYNA) | Xeon + MKL | Intel MKL BLAS, PML memory |
| Molecular dynamics (GROMACS, NAMD) | EPYC | AVX-512, InfiniBand, NVMe scratch |
| Monte Carlo (MCNP, OpenMC) | EPYC (high core count) | Thread-parallel, large memory |
| Genomics (BWA, GATK, SPAdes) | EPYC high-memory | Large RAM per node, fast scratch |
| Quantum chemistry (Gaussian, ORCA) | Xeon + MKL | Fast memory bandwidth, NUMA binding |
| Seismic processing (RSF, Madagascar) | EPYC | MPI + OpenMP hybrid, parallel I/O |
SLURM Configuration for CPU Clusters
# /etc/slurm/slurm.conf
ClusterName=hpc-cpu
ControlMachine=mgmt01
ControlAddr=10.0.1.10
# Enable cgroup resource isolation
TaskPlugin=task/cgroup,task/affinity
ProctrackType=proctrack/cgroup
# CPU affinity for MPI jobs
TaskAffinity=yes
# Node definitions (AMD EPYC 9654 example: 96 cores, 384 GB)
NodeName=cn[01-64] \
CPUs=192 \
Boards=1 \
SocketsPerBoard=2 \
CoresPerSocket=96 \
ThreadsPerCore=1 \
RealMemory=384000 \
State=UNKNOWN
# Partition design
PartitionName=short Nodes=cn[01-64] Default=YES MaxTime=24:00:00 State=UP
PartitionName=long Nodes=cn[01-64] Default=NO MaxTime=7-00:00:00 State=UP
PartitionName=debug Nodes=cn[01-02] Default=NO MaxTime=00:30:00 MaxNodes=2
NUMA Optimization
AMD EPYC’s multi-NUMA design requires careful process pinning for optimal memory bandwidth:
# Show NUMA topology
numactl --hardware
lscpu | grep NUMA
# Run GROMACS with NUMA-aware MPI binding
mpirun -np 16 \
--map-by socket:PE=6 \
--bind-to core \
--rank-by core \
gmx_mpi mdrun -v -deffnm production
# SLURM job with explicit NUMA binding
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=96
#SBATCH --cpus-per-task=1
#SBATCH --mem=384G
export OMP_NUM_THREADS=1
srun --cpu-bind=core ./my_simulation
For hybrid MPI+OpenMP applications, map one MPI process per NUMA domain and use OpenMP threads to fill the NUMA domain:
# 2-socket EPYC 9654: 2 MPI ranks per node, 96 threads each
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=96
export OMP_NUM_THREADS=96
export OMP_PROC_BIND=close
srun --cpu-bind=socket ./openfoam_hybrid
Network: InfiniBand for MPI
CPU clusters running MPI simulations require InfiniBand for low-latency inter-node communication:
# Verify InfiniBand is active and at expected speed
ibstat
ibv_devinfo | grep -E "state|max_msg_sz|active_width|active_speed"
# Test bandwidth between two nodes
# Server side:
ib_write_bw -d mlx5_0 -i 1
# Client side:
ib_write_bw -d mlx5_0 -i 1 <server_hostname>
# Expected for HDR200: ~23 GB/s effective
Configure OpenMPI to use InfiniBand explicitly:
# ~/.openmpi/mca-params.conf
btl = ^tcp # Disable TCP transport
btl_openib_allow_ib = 1 # Force InfiniBand
btl_openib_receive_queues = P,128,256,192,128:S,2048,256,64,32
Storage: BeeGFS vs Lustre
Both parallel filesystems are suitable for CPU cluster scratch storage:
| Feature | BeeGFS | Lustre |
|---|---|---|
| Setup complexity | Lower | Higher |
| Management | beegfs-ctl | lctl / lfs |
| Scalability | Up to ~PB | Multi-PB |
| HA (mirroring) | Buddy Mirror | OST mirroring |
| Metadata scaling | Multiple MDT servers | Single or distributed MDT |
| Small file performance | Good | Variable |
| Typical cluster size | 8–500 nodes | 100–100,000 nodes |
For clusters up to ~500 nodes, BeeGFS offers lower operational overhead. For very large clusters or those requiring enterprise support, Lustre is the traditional choice.
Benchmark Validation
Before production deployment, run these benchmarks to validate performance:
# HPL efficiency check
mpirun -np 192 ./xhpl # target: > 80% of peak TFlops
# STREAM memory bandwidth
export OMP_NUM_THREADS=96
./stream # target: > 90% of hardware spec
# IMB network latency
mpirun -np 2 -host cn01,cn02 ./IMB-MPI1 PingPong
# target: < 2 µs for InfiniBand HDR
# IOR parallel I/O
mpirun -np 64 ./ior -t 1m -b 8g -s 1 -F -w -r -o /scratch/ior_test
# target: aggregate > design bandwidth
CPU cluster design requires matching processor architecture, memory configuration, interconnect, and storage to your specific workload mix. Contact Mevasis for a workload-driven architecture recommendation and sizing analysis.