/ Blog

CPU Cluster Technical Guide: AMD EPYC vs Intel Xeon, SLURM Configuration, and Workloads

Technical guide for CPU-based HPC clusters: AMD EPYC vs Intel Xeon comparison, suitable workloads (MPI simulations, CFD, Monte Carlo), SLURM configuration, InfiniBand, BeeGFS vs Lustre, NUMA optimization, and benchmark validation.

While GPU clusters dominate AI/ML workloads, the vast majority of scientific HPC jobs still run on CPU clusters: fluid dynamics simulations, finite element analysis, Monte Carlo methods, bioinformatics pipelines, quantum chemistry calculations. CPU clusters remain the workhorse of research computing, and choosing the right processor architecture for your workload has a significant impact on both performance and cost.

Processor Architecture Comparison

AMD EPYC (Genoa/Turin Architecture)

AMD EPYC 9004 series (Genoa) and 9005 series (Turin) have become the dominant choice for new HPC deployments:

  • Up to 192 cores per socket (9965 / Turin)
  • DDR5 memory with 12-channel memory controller
  • 128 PCIe Gen5 lanes per socket
  • CCD (Core Complex Die) chiplet architecture with 12 CCDs per Genoa die
  • NUMA topology: up to 12 NUMA domains per socket

The high core count makes EPYC attractive for workloads that scale well across many cores. The chiplet architecture means NUMA effects are more pronounced than on monolithic designs — numactl binding is essential.

Best for: MPI simulations with many independent processes, Monte Carlo sampling, bioinformatics (BWA, GATK), EDA/CAE licensing that charges per socket.

Intel Xeon Scalable (Sapphire Rapids / Emerald Rapids)

Intel Xeon 6 (Sierra Forest and Granite Rapids) continues to hold advantages in specific workloads:

  • Up to 128 cores per socket (Granite Rapids)
  • Integrated AMX (Advanced Matrix Extensions) for AI inference on CPUs
  • Strong single-threaded performance
  • Mature ecosystem with Intel oneAPI, MKL, ICC

Best for: Applications that rely on Intel MKL (BLAS, LAPACK, FFT), workloads with strong single-thread performance requirements, mixed HPC/AI inference workloads.

ARM-based HPC (Ampere Altra, AWS Graviton3, Fujitsu A64FX)

ARM processors are increasingly viable for HPC:

  • Ampere Altra: 128 cores per socket, low power, single-NUMA simplicity
  • AWS Graviton3: strong HPC performance in cloud burst scenarios
  • Fujitsu A64FX: 512-bit SVE SIMD, native HBM2 memory, used in Fugaku

Best for: Power-constrained deployments, cloud bursting, applications with portable vectorization.

Workloads Well-Suited for CPU Clusters

WorkloadCPU ArchitectureKey Optimization
CFD (OpenFOAM, Fluent, STAR-CCM+)EPYC/XeonMPI binding, InfiniBand, large memory
FEM (ANSYS Mechanical, LS-DYNA)Xeon + MKLIntel MKL BLAS, PML memory
Molecular dynamics (GROMACS, NAMD)EPYCAVX-512, InfiniBand, NVMe scratch
Monte Carlo (MCNP, OpenMC)EPYC (high core count)Thread-parallel, large memory
Genomics (BWA, GATK, SPAdes)EPYC high-memoryLarge RAM per node, fast scratch
Quantum chemistry (Gaussian, ORCA)Xeon + MKLFast memory bandwidth, NUMA binding
Seismic processing (RSF, Madagascar)EPYCMPI + OpenMP hybrid, parallel I/O

SLURM Configuration for CPU Clusters

# /etc/slurm/slurm.conf
ClusterName=hpc-cpu
ControlMachine=mgmt01
ControlAddr=10.0.1.10

# Enable cgroup resource isolation
TaskPlugin=task/cgroup,task/affinity
ProctrackType=proctrack/cgroup

# CPU affinity for MPI jobs
TaskAffinity=yes

# Node definitions (AMD EPYC 9654 example: 96 cores, 384 GB)
NodeName=cn[01-64] \
  CPUs=192 \
  Boards=1 \
  SocketsPerBoard=2 \
  CoresPerSocket=96 \
  ThreadsPerCore=1 \
  RealMemory=384000 \
  State=UNKNOWN

# Partition design
PartitionName=short  Nodes=cn[01-64] Default=YES MaxTime=24:00:00 State=UP
PartitionName=long   Nodes=cn[01-64] Default=NO  MaxTime=7-00:00:00 State=UP
PartitionName=debug  Nodes=cn[01-02] Default=NO  MaxTime=00:30:00 MaxNodes=2

NUMA Optimization

AMD EPYC’s multi-NUMA design requires careful process pinning for optimal memory bandwidth:

# Show NUMA topology
numactl --hardware
lscpu | grep NUMA

# Run GROMACS with NUMA-aware MPI binding
mpirun -np 16 \
  --map-by socket:PE=6 \
  --bind-to core \
  --rank-by core \
  gmx_mpi mdrun -v -deffnm production

# SLURM job with explicit NUMA binding
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=96
#SBATCH --cpus-per-task=1
#SBATCH --mem=384G

export OMP_NUM_THREADS=1
srun --cpu-bind=core ./my_simulation

For hybrid MPI+OpenMP applications, map one MPI process per NUMA domain and use OpenMP threads to fill the NUMA domain:

# 2-socket EPYC 9654: 2 MPI ranks per node, 96 threads each
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=96
export OMP_NUM_THREADS=96
export OMP_PROC_BIND=close
srun --cpu-bind=socket ./openfoam_hybrid

Network: InfiniBand for MPI

CPU clusters running MPI simulations require InfiniBand for low-latency inter-node communication:

# Verify InfiniBand is active and at expected speed
ibstat
ibv_devinfo | grep -E "state|max_msg_sz|active_width|active_speed"

# Test bandwidth between two nodes
# Server side:
ib_write_bw -d mlx5_0 -i 1

# Client side:
ib_write_bw -d mlx5_0 -i 1 <server_hostname>

# Expected for HDR200: ~23 GB/s effective

Configure OpenMPI to use InfiniBand explicitly:

# ~/.openmpi/mca-params.conf
btl = ^tcp                    # Disable TCP transport
btl_openib_allow_ib = 1       # Force InfiniBand
btl_openib_receive_queues = P,128,256,192,128:S,2048,256,64,32

Storage: BeeGFS vs Lustre

Both parallel filesystems are suitable for CPU cluster scratch storage:

FeatureBeeGFSLustre
Setup complexityLowerHigher
Managementbeegfs-ctllctl / lfs
ScalabilityUp to ~PBMulti-PB
HA (mirroring)Buddy MirrorOST mirroring
Metadata scalingMultiple MDT serversSingle or distributed MDT
Small file performanceGoodVariable
Typical cluster size8–500 nodes100–100,000 nodes

For clusters up to ~500 nodes, BeeGFS offers lower operational overhead. For very large clusters or those requiring enterprise support, Lustre is the traditional choice.

Benchmark Validation

Before production deployment, run these benchmarks to validate performance:

# HPL efficiency check
mpirun -np 192 ./xhpl    # target: > 80% of peak TFlops

# STREAM memory bandwidth
export OMP_NUM_THREADS=96
./stream    # target: > 90% of hardware spec

# IMB network latency
mpirun -np 2 -host cn01,cn02 ./IMB-MPI1 PingPong
# target: < 2 µs for InfiniBand HDR

# IOR parallel I/O
mpirun -np 64 ./ior -t 1m -b 8g -s 1 -F -w -r -o /scratch/ior_test
# target: aggregate > design bandwidth

CPU cluster design requires matching processor architecture, memory configuration, interconnect, and storage to your specific workload mix. Contact Mevasis for a workload-driven architecture recommendation and sizing analysis.