HPC Cluster Benchmarking: HPL, STREAM, IMB-MPI1, and IOR
How to benchmark an HPC cluster with HPL/LINPACK, STREAM memory bandwidth, IMB-MPI1 network, and IOR parallel I/O tests. Interpretation of results and performance expectations.
Benchmarking an HPC cluster serves two purposes: verifying that the hardware and software stack are configured correctly before production, and establishing a performance baseline against which future changes can be measured. This guide covers the four benchmark categories that together give a complete picture of cluster health.
HPL / LINPACK: Peak Compute Performance
HPL (High Performance LINPACK) measures floating-point performance by solving a dense linear system. The result in GFlops or TFlops is used for Top500 rankings and tells you how close the cluster is to its theoretical peak.
# Install dependencies (RHEL/CentOS)
yum install openmpi-devel blas-devel
# Download and compile HPL
wget https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
tar xzf hpl-2.3.tar.gz
cd hpl-2.3
# Configure HPL.dat for a 2-node, 64-core test
# N should be approximately sqrt(0.8 * total_RAM_bytes / 8)
# For 2 nodes x 256 GB RAM = 512 GB -> N ≈ 254000
cat > HPL.dat << 'EOF'
HPLinpack benchmark input file
2 # Number of problems sizes
128000 256000 # N values
2 # Number of Ns blocks
2 4 # NB values
1 # PMAP process mapping
1 # Number of process grids
8 # P values (rows)
8 # Q values (cols)
16.0 # Threshold
3 # Number of panel fact
0 1 2 # PFACTs
2 # Number of recursive stopping criterion
2 4 # NBMINs
1 # Number of panels in recursion
2 # NDIVs
3 # Number of recursive panel fact
0 1 2 # RFACTs
1 # Number of broadcast
0 # BCASTs
1 # Number of lookahead depth
0 # DEPTHs
2 # SWAP
64 # swapping threshold
0 # L1 in (0=transposed,1=no-transposed) form
0 # U in (0=transposed,1=no-transposed) form
1 # Equilibration (0=no,1=yes)
8 # memory alignment in double (> 0)
EOF
mpirun -np 64 --hostfile hostfile ./xhpl
Interpreting HPL results:
| Efficiency | Interpretation |
|---|---|
| > 85% of peak | Excellent — well-tuned BLAS and MPI |
| 70–85% of peak | Good for most production workloads |
| 50–70% of peak | Investigate: memory bandwidth, MPI collective performance |
| < 50% of peak | Likely misconfiguration: NUMA binding, InfiniBand, BLAS library |
The most common cause of low HPL efficiency is using a generic BLAS library instead of a vendor-optimized one (Intel MKL for Intel CPUs, AMD AOCL for AMD EPYC, OpenBLAS as a portable alternative).
STREAM: Memory Bandwidth
HPL is compute-bound; many scientific applications are memory-bandwidth-bound. STREAM measures sustained memory bandwidth with four operations: Copy, Scale, Add, and Triad.
# Compile with OpenMP and optimization flags
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream
# Run with all cores
export OMP_NUM_THREADS=$(nproc)
export OMP_PROC_BIND=close
./stream
Expected STREAM Triad values (single node):
| CPU | Memory | Expected Triad |
|---|---|---|
| AMD EPYC 9654 (96c) | DDR5-4800 8-channel | ~800 GB/s |
| Intel Xeon 8480+ (60c) | DDR5-4800 8-channel | ~700 GB/s |
| AMD EPYC 7763 (64c) | DDR4-3200 8-channel | ~380 GB/s |
| Intel Xeon 6338 (32c) | DDR4-3200 6-channel | ~200 GB/s |
If STREAM Triad falls more than 20% below expected values, check NUMA memory locality, memory channel population (all channels must be populated), and memory speed in BIOS.
IMB-MPI1: Network and MPI Performance
The Intel MPI Benchmarks (IMB-MPI1) measure point-to-point and collective MPI operation latency and bandwidth.
# Install Intel MPI Benchmarks
git clone https://github.com/intel/mpi-benchmarks
cd mpi-benchmarks && make IMB-MPI1
# Point-to-point: latency and bandwidth between 2 nodes
mpirun -np 2 --hostfile hostfile ./IMB-MPI1 PingPong
# Collective: AllReduce with varying message sizes
mpirun -np 64 --hostfile hostfile ./IMB-MPI1 Allreduce
Expected InfiniBand HDR200 results:
| Metric | Expected Value |
|---|---|
| PingPong latency (0 bytes) | 1–2 µs |
| PingPong bandwidth (4 MB) | > 20 GB/s |
| AllReduce latency (8 bytes, 64 ranks) | < 5 µs |
| AllReduce bandwidth (1 MB, 64 ranks) | > 15 GB/s |
Latency values above 5 µs for zero-byte PingPong indicate a problem: wrong network interface selected, RDMA not enabled, or CPU frequency scaling interfering. Run ibstat to confirm InfiniBand links are at expected speed (4x HDR200 = 200 Gb/s).
IOR: Parallel Filesystem I/O
IOR tests parallel I/O performance in patterns that match real HPC workloads — many processes reading/writing simultaneously.
# Build IOR
git clone https://github.com/hpc/ior
cd ior && ./bootstrap && ./configure --with-mpiio && make
# Sequential write test: 64 processes, 1 GB per process, 1 MB block
mpirun -np 64 --hostfile hostfile \
./src/ior \
-t 1m -b 1g -s 1 \
-F \
-w -r \
-o /mnt/beegfs/scratch/ior_test/testfile \
-k
# Random read test
mpirun -np 64 --hostfile hostfile \
./src/ior \
-t 4k -b 1g -s 256 \
-F -r \
-o /mnt/beegfs/scratch/ior_test/testfile
Typical BeeGFS IOR results (8 storage servers, 10 GbE):
| Operation | Block Size | Expected Aggregate |
|---|---|---|
| Sequential write | 1 MB | 8–12 GB/s |
| Sequential read | 1 MB | 10–14 GB/s |
| Random write | 4 KB | 500 MB/s – 1 GB/s |
| Random read | 4 KB | 1–2 GB/s |
If sequential write is much lower than expected, check: network interface speed on storage servers, BeeGFS stripe count vs. process count ratio, and whether the filesystem has sufficient metadata capacity to handle the file creation rate.
Benchmark Checklist Before Production
Before declaring an HPC cluster production-ready, all four benchmark categories should meet or exceed the targets established in the requirements phase:
- HPL efficiency > 80% of theoretical peak FLOPS
- STREAM Triad within 10% of hardware specification
- IMB-MPI1 PingPong latency below 3 µs over InfiniBand
- IOR aggregate bandwidth matches storage tier design target
Any shortfall indicates a configuration issue that is much cheaper to resolve before users begin loading the system with real workloads.
For benchmark-driven HPC cluster validation and performance tuning, contact the Mevasis engineering team. We run all four benchmark suites as part of every cluster deployment acceptance procedure.