/ Blog

HPC Cluster Benchmarking: HPL, STREAM, IMB-MPI1, and IOR

How to benchmark an HPC cluster with HPL/LINPACK, STREAM memory bandwidth, IMB-MPI1 network, and IOR parallel I/O tests. Interpretation of results and performance expectations.

Benchmarking an HPC cluster serves two purposes: verifying that the hardware and software stack are configured correctly before production, and establishing a performance baseline against which future changes can be measured. This guide covers the four benchmark categories that together give a complete picture of cluster health.

HPL / LINPACK: Peak Compute Performance

HPL (High Performance LINPACK) measures floating-point performance by solving a dense linear system. The result in GFlops or TFlops is used for Top500 rankings and tells you how close the cluster is to its theoretical peak.

# Install dependencies (RHEL/CentOS)
yum install openmpi-devel blas-devel

# Download and compile HPL
wget https://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
tar xzf hpl-2.3.tar.gz
cd hpl-2.3

# Configure HPL.dat for a 2-node, 64-core test
# N should be approximately sqrt(0.8 * total_RAM_bytes / 8)
# For 2 nodes x 256 GB RAM = 512 GB -> N ≈ 254000
cat > HPL.dat << 'EOF'
HPLinpack benchmark input file
2                    # Number of problems sizes
128000 256000        # N values
2                    # Number of Ns blocks
2 4                  # NB values
1                    # PMAP process mapping
1                    # Number of process grids
8                    # P values (rows)
8                    # Q values (cols)
16.0                 # Threshold
3                    # Number of panel fact
0 1 2                # PFACTs
2                    # Number of recursive stopping criterion
2 4                  # NBMINs
1                    # Number of panels in recursion
2                    # NDIVs
3                    # Number of recursive panel fact
0 1 2                # RFACTs
1                    # Number of broadcast
0                    # BCASTs
1                    # Number of lookahead depth
0                    # DEPTHs
2                    # SWAP
64                   # swapping threshold
0                    # L1 in (0=transposed,1=no-transposed) form
0                    # U  in (0=transposed,1=no-transposed) form
1                    # Equilibration (0=no,1=yes)
8                    # memory alignment in double (> 0)
EOF

mpirun -np 64 --hostfile hostfile ./xhpl

Interpreting HPL results:

EfficiencyInterpretation
> 85% of peakExcellent — well-tuned BLAS and MPI
70–85% of peakGood for most production workloads
50–70% of peakInvestigate: memory bandwidth, MPI collective performance
< 50% of peakLikely misconfiguration: NUMA binding, InfiniBand, BLAS library

The most common cause of low HPL efficiency is using a generic BLAS library instead of a vendor-optimized one (Intel MKL for Intel CPUs, AMD AOCL for AMD EPYC, OpenBLAS as a portable alternative).

STREAM: Memory Bandwidth

HPL is compute-bound; many scientific applications are memory-bandwidth-bound. STREAM measures sustained memory bandwidth with four operations: Copy, Scale, Add, and Triad.

# Compile with OpenMP and optimization flags
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream

# Run with all cores
export OMP_NUM_THREADS=$(nproc)
export OMP_PROC_BIND=close
./stream

Expected STREAM Triad values (single node):

CPUMemoryExpected Triad
AMD EPYC 9654 (96c)DDR5-4800 8-channel~800 GB/s
Intel Xeon 8480+ (60c)DDR5-4800 8-channel~700 GB/s
AMD EPYC 7763 (64c)DDR4-3200 8-channel~380 GB/s
Intel Xeon 6338 (32c)DDR4-3200 6-channel~200 GB/s

If STREAM Triad falls more than 20% below expected values, check NUMA memory locality, memory channel population (all channels must be populated), and memory speed in BIOS.

IMB-MPI1: Network and MPI Performance

The Intel MPI Benchmarks (IMB-MPI1) measure point-to-point and collective MPI operation latency and bandwidth.

# Install Intel MPI Benchmarks
git clone https://github.com/intel/mpi-benchmarks
cd mpi-benchmarks && make IMB-MPI1

# Point-to-point: latency and bandwidth between 2 nodes
mpirun -np 2 --hostfile hostfile ./IMB-MPI1 PingPong

# Collective: AllReduce with varying message sizes
mpirun -np 64 --hostfile hostfile ./IMB-MPI1 Allreduce

Expected InfiniBand HDR200 results:

MetricExpected Value
PingPong latency (0 bytes)1–2 µs
PingPong bandwidth (4 MB)> 20 GB/s
AllReduce latency (8 bytes, 64 ranks)< 5 µs
AllReduce bandwidth (1 MB, 64 ranks)> 15 GB/s

Latency values above 5 µs for zero-byte PingPong indicate a problem: wrong network interface selected, RDMA not enabled, or CPU frequency scaling interfering. Run ibstat to confirm InfiniBand links are at expected speed (4x HDR200 = 200 Gb/s).

IOR: Parallel Filesystem I/O

IOR tests parallel I/O performance in patterns that match real HPC workloads — many processes reading/writing simultaneously.

# Build IOR
git clone https://github.com/hpc/ior
cd ior && ./bootstrap && ./configure --with-mpiio && make

# Sequential write test: 64 processes, 1 GB per process, 1 MB block
mpirun -np 64 --hostfile hostfile \
  ./src/ior \
  -t 1m -b 1g -s 1 \
  -F \
  -w -r \
  -o /mnt/beegfs/scratch/ior_test/testfile \
  -k

# Random read test
mpirun -np 64 --hostfile hostfile \
  ./src/ior \
  -t 4k -b 1g -s 256 \
  -F -r \
  -o /mnt/beegfs/scratch/ior_test/testfile

Typical BeeGFS IOR results (8 storage servers, 10 GbE):

OperationBlock SizeExpected Aggregate
Sequential write1 MB8–12 GB/s
Sequential read1 MB10–14 GB/s
Random write4 KB500 MB/s – 1 GB/s
Random read4 KB1–2 GB/s

If sequential write is much lower than expected, check: network interface speed on storage servers, BeeGFS stripe count vs. process count ratio, and whether the filesystem has sufficient metadata capacity to handle the file creation rate.

Benchmark Checklist Before Production

Before declaring an HPC cluster production-ready, all four benchmark categories should meet or exceed the targets established in the requirements phase:

  1. HPL efficiency > 80% of theoretical peak FLOPS
  2. STREAM Triad within 10% of hardware specification
  3. IMB-MPI1 PingPong latency below 3 µs over InfiniBand
  4. IOR aggregate bandwidth matches storage tier design target

Any shortfall indicates a configuration issue that is much cheaper to resolve before users begin loading the system with real workloads.


For benchmark-driven HPC cluster validation and performance tuning, contact the Mevasis engineering team. We run all four benchmark suites as part of every cluster deployment acceptance procedure.