CPU HPC Cluster Guide: AMD EPYC vs Intel Xeon for HPC Workloads

While GPU clusters dominate AI/ML workloads, the vast majority of scientific HPC jobs still run on CPU clusters: fluid dynamics simulations, finite element analysis, Monte Carlo methods, bioinformatics pipelines, quantum chemistry calculations. CPU clusters remain the workhorse of research computing, and choosing the right processor architecture for your workload has a significant impact on both performance and cost.

Processor Architecture Comparison

AMD EPYC (Genoa/Turin Architecture)

AMD EPYC 9004 series (Genoa) and 9005 series (Turin) have become the dominant choice for new HPC deployments:

Up to 192 cores per socket (9965 / Turin)
DDR5 memory with 12-channel memory controller
128 PCIe Gen5 lanes per socket
CCD (Core Complex Die) chiplet architecture with 12 CCDs per Genoa die
NUMA topology: up to 12 NUMA domains per socket

The high core count makes EPYC attractive for workloads that scale well across many cores. The chiplet architecture means NUMA effects are more pronounced than on monolithic designs — numactl binding is essential.

Best for: MPI simulations with many independent processes, Monte Carlo sampling, bioinformatics (BWA, GATK), EDA/CAE licensing that charges per socket.

Intel Xeon Scalable (Sapphire Rapids / Emerald Rapids)

Intel Xeon 6 (Sierra Forest and Granite Rapids) continues to hold advantages in specific workloads:

Up to 128 cores per socket (Granite Rapids)
Integrated AMX (Advanced Matrix Extensions) for AI inference on CPUs
Strong single-threaded performance
Mature ecosystem with Intel oneAPI, MKL, ICC

Best for: Applications that rely on Intel MKL (BLAS, LAPACK, FFT), workloads with strong single-thread performance requirements, mixed HPC/AI inference workloads.

ARM-based HPC (Ampere Altra, AWS Graviton3, Fujitsu A64FX)

ARM processors are increasingly viable for HPC:

Ampere Altra: 128 cores per socket, low power, single-NUMA simplicity
AWS Graviton3: strong HPC performance in cloud burst scenarios
Fujitsu A64FX: 512-bit SVE SIMD, native HBM2 memory, used in Fugaku

Best for: Power-constrained deployments, cloud bursting, applications with portable vectorization.

Workloads Well-Suited for CPU Clusters

Workload	CPU Architecture	Key Optimization
CFD (OpenFOAM, Fluent, STAR-CCM+)	EPYC/Xeon	MPI binding, InfiniBand, large memory
FEM (ANSYS Mechanical, LS-DYNA)	Xeon + MKL	Intel MKL BLAS, PML memory
Molecular dynamics (GROMACS, NAMD)	EPYC	AVX-512, InfiniBand, NVMe scratch
Monte Carlo (MCNP, OpenMC)	EPYC (high core count)	Thread-parallel, large memory
Genomics (BWA, GATK, SPAdes)	EPYC high-memory	Large RAM per node, fast scratch
Quantum chemistry (Gaussian, ORCA)	Xeon + MKL	Fast memory bandwidth, NUMA binding
Seismic processing (RSF, Madagascar)	EPYC	MPI + OpenMP hybrid, parallel I/O

SLURM Configuration for CPU Clusters

# /etc/slurm/slurm.conf
ClusterName=hpc-cpu
ControlMachine=mgmt01
ControlAddr=10.0.1.10

# Enable cgroup resource isolation
TaskPlugin=task/cgroup,task/affinity
ProctrackType=proctrack/cgroup

# CPU affinity for MPI jobs
TaskAffinity=yes

# Node definitions (AMD EPYC 9654 example: 96 cores, 384 GB)
NodeName=cn[01-64] \
  CPUs=192 \
  Boards=1 \
  SocketsPerBoard=2 \
  CoresPerSocket=96 \
  ThreadsPerCore=1 \
  RealMemory=384000 \
  State=UNKNOWN

# Partition design
PartitionName=short  Nodes=cn[01-64] Default=YES MaxTime=24:00:00 State=UP
PartitionName=long   Nodes=cn[01-64] Default=NO  MaxTime=7-00:00:00 State=UP
PartitionName=debug  Nodes=cn[01-02] Default=NO  MaxTime=00:30:00 MaxNodes=2

NUMA Optimization

AMD EPYC’s multi-NUMA design requires careful process pinning for optimal memory bandwidth:

# Show NUMA topology
numactl --hardware
lscpu | grep NUMA

# Run GROMACS with NUMA-aware MPI binding
mpirun -np 16 \
  --map-by socket:PE=6 \
  --bind-to core \
  --rank-by core \
  gmx_mpi mdrun -v -deffnm production

# SLURM job with explicit NUMA binding
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=96
#SBATCH --cpus-per-task=1
#SBATCH --mem=384G

export OMP_NUM_THREADS=1
srun --cpu-bind=core ./my_simulation

For hybrid MPI+OpenMP applications, map one MPI process per NUMA domain and use OpenMP threads to fill the NUMA domain:

# 2-socket EPYC 9654: 2 MPI ranks per node, 96 threads each
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=96
export OMP_NUM_THREADS=96
export OMP_PROC_BIND=close
srun --cpu-bind=socket ./openfoam_hybrid

Network: InfiniBand for MPI

CPU clusters running MPI simulations require InfiniBand for low-latency inter-node communication:

# Verify InfiniBand is active and at expected speed
ibstat
ibv_devinfo | grep -E "state|max_msg_sz|active_width|active_speed"

# Test bandwidth between two nodes
# Server side:
ib_write_bw -d mlx5_0 -i 1

# Client side:
ib_write_bw -d mlx5_0 -i 1 <server_hostname>

# Expected for HDR200: ~23 GB/s effective

Configure OpenMPI to use InfiniBand explicitly:

# ~/.openmpi/mca-params.conf
btl = ^tcp                    # Disable TCP transport
btl_openib_allow_ib = 1       # Force InfiniBand
btl_openib_receive_queues = P,128,256,192,128:S,2048,256,64,32

Storage: BeeGFS vs Lustre

Both parallel filesystems are suitable for CPU cluster scratch storage:

Feature	BeeGFS	Lustre
Setup complexity	Lower	Higher
Management	beegfs-ctl	lctl / lfs
Scalability	Up to ~PB	Multi-PB
HA (mirroring)	Buddy Mirror	OST mirroring
Metadata scaling	Multiple MDT servers	Single or distributed MDT
Small file performance	Good	Variable
Typical cluster size	8–500 nodes	100–100,000 nodes

For clusters up to ~500 nodes, BeeGFS offers lower operational overhead. For very large clusters or those requiring enterprise support, Lustre is the traditional choice.

Benchmark Validation

Before production deployment, run these benchmarks to validate performance:

# HPL efficiency check
mpirun -np 192 ./xhpl    # target: > 80% of peak TFlops

# STREAM memory bandwidth
export OMP_NUM_THREADS=96
./stream    # target: > 90% of hardware spec

# IMB network latency
mpirun -np 2 -host cn01,cn02 ./IMB-MPI1 PingPong
# target: < 2 µs for InfiniBand HDR

# IOR parallel I/O
mpirun -np 64 ./ior -t 1m -b 8g -s 1 -F -w -r -o /scratch/ior_test
# target: aggregate > design bandwidth

CPU cluster design requires matching processor architecture, memory configuration, interconnect, and storage to your specific workload mix. Contact Mevasis for a workload-driven architecture recommendation and sizing analysis.

CPU Cluster Technical Guide: AMD EPYC vs Intel Xeon, SLURM Configuration, and Workloads