MPI Programming Guide for HPC: OpenMPI, MVAPICH2, SLURM Integration

MPI (Message Passing Interface) is the standard communication library for distributed-memory parallel computing on HPC clusters. While GPU frameworks and shared-memory threading dominate single-node parallelism, MPI remains the only standard way to parallelize computation across multiple physical servers — which is what HPC clusters exist to enable.

MPI vs OpenMP: When to Use Each

The two most important parallel programming models for HPC:

Aspect	MPI	OpenMP
Memory model	Distributed (each process has private memory)	Shared (all threads share memory)
Scope	Across nodes in a cluster	Within a single node (multiple cores)
Communication	Explicit send/receive or collective operations	Shared variables, no explicit communication
Scalability	Thousands of nodes	Limited to cores on one node
Programming complexity	Higher (explicit communication)	Lower (pragma directives)
Best for	Large-scale cluster simulations	Shared-memory parallelism, loop acceleration

Many production HPC codes use hybrid MPI+OpenMP: one MPI process per NUMA domain (or per socket), with OpenMP threads filling the available cores within each MPI process. This reduces inter-node message count while still utilizing all compute cores.

OpenMPI Installation

# Ubuntu/Debian
apt-get update && apt-get install -y \
  openmpi-bin libopenmpi-dev \
  libpmix-dev    # for SLURM PMIx integration

# RHEL/Rocky Linux
dnf install openmpi openmpi-devel

# Or compile from source (for InfiniBand-optimized build)
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.6.tar.gz
tar xzf openmpi-4.1.6.tar.gz && cd openmpi-4.1.6
./configure \
  --prefix=/opt/openmpi-4.1.6 \
  --with-ucx=/opt/ucx \          # UCX for InfiniBand/RoCE
  --with-pmix \                   # PMIx for SLURM integration
  --enable-mpi-cxx \
  --enable-shared
make -j8 && make install

# Load module (if using environment modules)
module load openmpi/4.1.6
mpirun --version

For InfiniBand clusters, always compile OpenMPI with UCX (Unified Communication X) support. UCX selects the optimal transport (InfiniBand RDMA, shared memory, TCP) automatically:

# Verify UCX/InfiniBand integration
ucx_info -d | grep ib
ompi_info | grep "Open UCX"

Rank and Communicator Concepts

MPI_COMM_WORLD: The default communicator that includes all processes in an MPI job. Every MPI application starts with this communicator.

Rank: A unique integer identifier (0 to N-1) assigned to each process within a communicator. Process 0 is conventionally the “root” that collects results or performs I/O.

Communicator: A group of processes that can communicate with each other. Custom communicators can be created for subsets of processes (useful for domain decomposition).

Essential MPI Functions

Function	Purpose
`MPI_Init`	Initialize MPI (must be first MPI call)
`MPI_Finalize`	Clean up MPI (must be last MPI call)
`MPI_Comm_rank`	Get this process’s rank
`MPI_Comm_size`	Get total number of processes
`MPI_Send`	Blocking send to specific rank
`MPI_Recv`	Blocking receive from specific rank
`MPI_Isend`	Non-blocking send
`MPI_Irecv`	Non-blocking receive
`MPI_Bcast`	Broadcast from root to all
`MPI_Scatter`	Distribute array from root to all
`MPI_Gather`	Collect values from all to root
`MPI_Reduce`	Reduce values from all to root (sum, max, etc.)
`MPI_Allreduce`	Reduce values and distribute result to all
`MPI_Barrier`	Synchronize all processes

Hello World in MPI/C

/* hello_mpi.c — MPI Hello World */
#include <mpi.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
    int rank, size;
    char hostname[256];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    gethostname(hostname, sizeof(hostname));
    printf("Hello from rank %d of %d on host %s\n", rank, size, hostname);

    MPI_Finalize();
    return 0;
}

# Compile
mpicc -o hello_mpi hello_mpi.c

# Run on 4 processes (local)
mpirun -np 4 ./hello_mpi

# Run across multiple nodes
mpirun -np 16 --hostfile hosts.txt ./hello_mpi
# hosts.txt: one hostname per line, with optional slot count
# node01 slots=8
# node02 slots=8

Point-to-Point Communication

/* ping_pong.c — Measure latency and bandwidth between ranks 0 and 1 */
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char *argv[]) {
    int rank, size;
    int n = 1000000;    // 1M doubles = 8 MB message
    double *buf;
    double t_start, t_end, latency, bandwidth;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (size < 2) {
        fprintf(stderr, "Needs at least 2 processes\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    buf = (double *)malloc(n * sizeof(double));

    if (rank == 0) {
        t_start = MPI_Wtime();

        MPI_Send(buf, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
        MPI_Recv(buf, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

        t_end = MPI_Wtime();
        latency = (t_end - t_start) / 2.0;
        bandwidth = (n * sizeof(double)) / latency / 1e9;   // GB/s

        printf("Latency: %.2f µs, Bandwidth: %.2f GB/s\n",
               latency * 1e6, bandwidth);
    } else if (rank == 1) {
        MPI_Recv(buf, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        MPI_Send(buf, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
    }

    free(buf);
    MPI_Finalize();
    return 0;
}

MPI_Reduce: Pi Calculation Example

This classic example demonstrates collective reduction where each process computes a portion of the Pi approximation:

/* mpi_pi.c — Compute Pi using Leibniz formula with MPI_Reduce */
#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    int rank, size;
    long long n = 1000000000LL;   // 1 billion terms
    double local_sum = 0.0, pi = 0.0;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    /* Each rank computes its portion of the sum */
    for (long long i = rank; i < n; i += size) {
        double sign = (i % 2 == 0) ? 1.0 : -1.0;
        local_sum += sign / (2.0 * i + 1.0);
    }

    /* Sum all local results to rank 0 */
    MPI_Reduce(&local_sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

    if (rank == 0) {
        pi *= 4.0;
        printf("Computed Pi = %.15f (actual = 3.141592653589793)\n", pi);
    }

    MPI_Finalize();
    return 0;
}

mpicc -O3 -o mpi_pi mpi_pi.c
mpirun -np 64 ./mpi_pi

With 64 MPI processes on a single node (or across nodes), this runs 64× faster than the serial version due to perfect parallelism (no communication overhead except the final MPI_Reduce).

SLURM Job Submission with MPI

#!/bin/bash
#SBATCH --job-name=mpi_simulation
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=06:00:00
#SBATCH --partition=compute

# Load MPI environment
module load openmpi/4.1.6

# Run with srun (preferred for SLURM+MPI integration via PMIx)
srun ./my_simulation --config=run.cfg

# Alternative: mpirun (use when srun has issues with specific applications)
mpirun -np $SLURM_NTASKS \
  --map-by socket \
  --bind-to core \
  ./my_simulation --config=run.cfg

Key SLURM environment variables available inside MPI jobs:

Variable	Description
`$SLURM_JOB_ID`	Job ID
`$SLURM_NTASKS`	Total number of MPI processes
`$SLURM_NTASKS_PER_NODE`	MPI processes per node
`$SLURM_NNODES`	Total number of nodes
`$SLURM_NODELIST`	List of allocated nodes
`$SLURM_SUBMIT_DIR`	Directory where sbatch was run
`$SLURM_JOB_NODELIST`	Node list (same as NODELIST)

Performance Tips

Use srun instead of mpirun for SLURM jobs:

# srun uses SLURM's PMIx/PMI2 for process management, which is more reliable
srun --mpi=pmix ./simulation
# vs mpirun which manages its own process launch

CPU binding for NUMA-aware performance:

# Bind each MPI process to a specific core (prevents OS migration)
mpirun --bind-to core --map-by core ./simulation

# For hybrid MPI+OpenMP: one MPI per socket, OpenMP fills the socket
#SBATCH --ntasks-per-node=2  (2 sockets per node)
#SBATCH --cpus-per-task=96   (96 cores per socket on EPYC 9654)
export OMP_NUM_THREADS=96
srun --cpu-bind=socket ./hybrid_sim

Avoid small-message ping-pong overhead with aggregation:

# Bad: many small MPI_Send calls per timestep
for i in neighbors:
    MPI_Send(small_buf, 100, MPI_DOUBLE, i, tag, comm)

# Better: pack all neighbor data into one message
MPI_Send(packed_buf, total_size, MPI_DOUBLE, neighbor, tag, comm)

Common Errors

Deadlock from mismatched send/receive:

// WRONG: All ranks send before receiving — deadlock
MPI_Send(buf, n, MPI_DOUBLE, neighbor, tag, comm);   // blocks until received
MPI_Recv(buf, n, MPI_DOUBLE, neighbor, tag, comm, &status);

// CORRECT: Use non-blocking sends
MPI_Request req;
MPI_Isend(buf, n, MPI_DOUBLE, neighbor, tag, comm, &req);
MPI_Recv(rbuf, n, MPI_DOUBLE, neighbor, tag, comm, &status);
MPI_Wait(&req, &status);

// Or use MPI_Sendrecv which handles the ordering
MPI_Sendrecv(sbuf, n, MPI_DOUBLE, dest, stag,
             rbuf, n, MPI_DOUBLE, src,  rtag,
             comm, &status);

Missing MPI_Finalize:

// Always call MPI_Finalize before exit, even on error
if (error_condition) {
    MPI_Finalize();    // clean shutdown
    return 1;
}
// Or use MPI_Abort for fatal errors
MPI_Abort(MPI_COMM_WORLD, 1);

Process count mismatch:

# Application hardcodes a process count that doesn't match sbatch
# Always use $SLURM_NTASKS or argv-based process count
mpirun -np $SLURM_NTASKS ./simulation    # correct
mpirun -np 128 ./simulation              # wrong if sbatch has --ntasks != 128

MPI is a deep standard with hundreds of functions beyond what is covered here. The key to efficient MPI programming is understanding collective operations (which are highly optimized over InfiniBand) and minimizing the number of synchronization points per unit of computation. Contact Mevasis for MPI application optimization and HPC cluster deployment services.

MPI Programming Guide: Installation, Communicators, Collective Operations, and SLURM