/ Blog

MPI Programming Guide: Installation, Communicators, Collective Operations, and SLURM

Comprehensive MPI programming guide: MPI vs OpenMP comparison, OpenMPI installation, rank and communicator concepts, MPI function table, hello world in C, point-to-point communication, MPI_Reduce Pi calculation, SLURM job submission with MPI, environment variables, performance tips, and common errors.

MPI (Message Passing Interface) is the standard communication library for distributed-memory parallel computing on HPC clusters. While GPU frameworks and shared-memory threading dominate single-node parallelism, MPI remains the only standard way to parallelize computation across multiple physical servers — which is what HPC clusters exist to enable.

MPI vs OpenMP: When to Use Each

The two most important parallel programming models for HPC:

AspectMPIOpenMP
Memory modelDistributed (each process has private memory)Shared (all threads share memory)
ScopeAcross nodes in a clusterWithin a single node (multiple cores)
CommunicationExplicit send/receive or collective operationsShared variables, no explicit communication
ScalabilityThousands of nodesLimited to cores on one node
Programming complexityHigher (explicit communication)Lower (pragma directives)
Best forLarge-scale cluster simulationsShared-memory parallelism, loop acceleration

Many production HPC codes use hybrid MPI+OpenMP: one MPI process per NUMA domain (or per socket), with OpenMP threads filling the available cores within each MPI process. This reduces inter-node message count while still utilizing all compute cores.

OpenMPI Installation

# Ubuntu/Debian
apt-get update && apt-get install -y \
  openmpi-bin libopenmpi-dev \
  libpmix-dev    # for SLURM PMIx integration

# RHEL/Rocky Linux
dnf install openmpi openmpi-devel

# Or compile from source (for InfiniBand-optimized build)
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.6.tar.gz
tar xzf openmpi-4.1.6.tar.gz && cd openmpi-4.1.6
./configure \
  --prefix=/opt/openmpi-4.1.6 \
  --with-ucx=/opt/ucx \          # UCX for InfiniBand/RoCE
  --with-pmix \                   # PMIx for SLURM integration
  --enable-mpi-cxx \
  --enable-shared
make -j8 && make install

# Load module (if using environment modules)
module load openmpi/4.1.6
mpirun --version

For InfiniBand clusters, always compile OpenMPI with UCX (Unified Communication X) support. UCX selects the optimal transport (InfiniBand RDMA, shared memory, TCP) automatically:

# Verify UCX/InfiniBand integration
ucx_info -d | grep ib
ompi_info | grep "Open UCX"

Rank and Communicator Concepts

MPI_COMM_WORLD: The default communicator that includes all processes in an MPI job. Every MPI application starts with this communicator.

Rank: A unique integer identifier (0 to N-1) assigned to each process within a communicator. Process 0 is conventionally the “root” that collects results or performs I/O.

Communicator: A group of processes that can communicate with each other. Custom communicators can be created for subsets of processes (useful for domain decomposition).

Essential MPI Functions

FunctionPurpose
MPI_InitInitialize MPI (must be first MPI call)
MPI_FinalizeClean up MPI (must be last MPI call)
MPI_Comm_rankGet this process’s rank
MPI_Comm_sizeGet total number of processes
MPI_SendBlocking send to specific rank
MPI_RecvBlocking receive from specific rank
MPI_IsendNon-blocking send
MPI_IrecvNon-blocking receive
MPI_BcastBroadcast from root to all
MPI_ScatterDistribute array from root to all
MPI_GatherCollect values from all to root
MPI_ReduceReduce values from all to root (sum, max, etc.)
MPI_AllreduceReduce values and distribute result to all
MPI_BarrierSynchronize all processes

Hello World in MPI/C

/* hello_mpi.c — MPI Hello World */
#include <mpi.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[]) {
    int rank, size;
    char hostname[256];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    gethostname(hostname, sizeof(hostname));
    printf("Hello from rank %d of %d on host %s\n", rank, size, hostname);

    MPI_Finalize();
    return 0;
}
# Compile
mpicc -o hello_mpi hello_mpi.c

# Run on 4 processes (local)
mpirun -np 4 ./hello_mpi

# Run across multiple nodes
mpirun -np 16 --hostfile hosts.txt ./hello_mpi
# hosts.txt: one hostname per line, with optional slot count
# node01 slots=8
# node02 slots=8

Point-to-Point Communication

/* ping_pong.c — Measure latency and bandwidth between ranks 0 and 1 */
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char *argv[]) {
    int rank, size;
    int n = 1000000;    // 1M doubles = 8 MB message
    double *buf;
    double t_start, t_end, latency, bandwidth;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (size < 2) {
        fprintf(stderr, "Needs at least 2 processes\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    buf = (double *)malloc(n * sizeof(double));

    if (rank == 0) {
        t_start = MPI_Wtime();

        MPI_Send(buf, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
        MPI_Recv(buf, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

        t_end = MPI_Wtime();
        latency = (t_end - t_start) / 2.0;
        bandwidth = (n * sizeof(double)) / latency / 1e9;   // GB/s

        printf("Latency: %.2f µs, Bandwidth: %.2f GB/s\n",
               latency * 1e6, bandwidth);
    } else if (rank == 1) {
        MPI_Recv(buf, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        MPI_Send(buf, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
    }

    free(buf);
    MPI_Finalize();
    return 0;
}

MPI_Reduce: Pi Calculation Example

This classic example demonstrates collective reduction where each process computes a portion of the Pi approximation:

/* mpi_pi.c — Compute Pi using Leibniz formula with MPI_Reduce */
#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    int rank, size;
    long long n = 1000000000LL;   // 1 billion terms
    double local_sum = 0.0, pi = 0.0;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    /* Each rank computes its portion of the sum */
    for (long long i = rank; i < n; i += size) {
        double sign = (i % 2 == 0) ? 1.0 : -1.0;
        local_sum += sign / (2.0 * i + 1.0);
    }

    /* Sum all local results to rank 0 */
    MPI_Reduce(&local_sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

    if (rank == 0) {
        pi *= 4.0;
        printf("Computed Pi = %.15f (actual = 3.141592653589793)\n", pi);
    }

    MPI_Finalize();
    return 0;
}
mpicc -O3 -o mpi_pi mpi_pi.c
mpirun -np 64 ./mpi_pi

With 64 MPI processes on a single node (or across nodes), this runs 64× faster than the serial version due to perfect parallelism (no communication overhead except the final MPI_Reduce).

SLURM Job Submission with MPI

#!/bin/bash
#SBATCH --job-name=mpi_simulation
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=06:00:00
#SBATCH --partition=compute

# Load MPI environment
module load openmpi/4.1.6

# Run with srun (preferred for SLURM+MPI integration via PMIx)
srun ./my_simulation --config=run.cfg

# Alternative: mpirun (use when srun has issues with specific applications)
mpirun -np $SLURM_NTASKS \
  --map-by socket \
  --bind-to core \
  ./my_simulation --config=run.cfg

Key SLURM environment variables available inside MPI jobs:

VariableDescription
$SLURM_JOB_IDJob ID
$SLURM_NTASKSTotal number of MPI processes
$SLURM_NTASKS_PER_NODEMPI processes per node
$SLURM_NNODESTotal number of nodes
$SLURM_NODELISTList of allocated nodes
$SLURM_SUBMIT_DIRDirectory where sbatch was run
$SLURM_JOB_NODELISTNode list (same as NODELIST)

Performance Tips

Use srun instead of mpirun for SLURM jobs:

# srun uses SLURM's PMIx/PMI2 for process management, which is more reliable
srun --mpi=pmix ./simulation
# vs mpirun which manages its own process launch

CPU binding for NUMA-aware performance:

# Bind each MPI process to a specific core (prevents OS migration)
mpirun --bind-to core --map-by core ./simulation

# For hybrid MPI+OpenMP: one MPI per socket, OpenMP fills the socket
#SBATCH --ntasks-per-node=2  (2 sockets per node)
#SBATCH --cpus-per-task=96   (96 cores per socket on EPYC 9654)
export OMP_NUM_THREADS=96
srun --cpu-bind=socket ./hybrid_sim

Avoid small-message ping-pong overhead with aggregation:

# Bad: many small MPI_Send calls per timestep
for i in neighbors:
    MPI_Send(small_buf, 100, MPI_DOUBLE, i, tag, comm)

# Better: pack all neighbor data into one message
MPI_Send(packed_buf, total_size, MPI_DOUBLE, neighbor, tag, comm)

Common Errors

Deadlock from mismatched send/receive:

// WRONG: All ranks send before receiving — deadlock
MPI_Send(buf, n, MPI_DOUBLE, neighbor, tag, comm);   // blocks until received
MPI_Recv(buf, n, MPI_DOUBLE, neighbor, tag, comm, &status);

// CORRECT: Use non-blocking sends
MPI_Request req;
MPI_Isend(buf, n, MPI_DOUBLE, neighbor, tag, comm, &req);
MPI_Recv(rbuf, n, MPI_DOUBLE, neighbor, tag, comm, &status);
MPI_Wait(&req, &status);

// Or use MPI_Sendrecv which handles the ordering
MPI_Sendrecv(sbuf, n, MPI_DOUBLE, dest, stag,
             rbuf, n, MPI_DOUBLE, src,  rtag,
             comm, &status);

Missing MPI_Finalize:

// Always call MPI_Finalize before exit, even on error
if (error_condition) {
    MPI_Finalize();    // clean shutdown
    return 1;
}
// Or use MPI_Abort for fatal errors
MPI_Abort(MPI_COMM_WORLD, 1);

Process count mismatch:

# Application hardcodes a process count that doesn't match sbatch
# Always use $SLURM_NTASKS or argv-based process count
mpirun -np $SLURM_NTASKS ./simulation    # correct
mpirun -np 128 ./simulation              # wrong if sbatch has --ntasks != 128

MPI is a deep standard with hundreds of functions beyond what is covered here. The key to efficient MPI programming is understanding collective operations (which are highly optimized over InfiniBand) and minimizing the number of synchronization points per unit of computation. Contact Mevasis for MPI application optimization and HPC cluster deployment services.