MPI Programming Guide: Installation, Communicators, Collective Operations, and SLURM
Comprehensive MPI programming guide: MPI vs OpenMP comparison, OpenMPI installation, rank and communicator concepts, MPI function table, hello world in C, point-to-point communication, MPI_Reduce Pi calculation, SLURM job submission with MPI, environment variables, performance tips, and common errors.
MPI (Message Passing Interface) is the standard communication library for distributed-memory parallel computing on HPC clusters. While GPU frameworks and shared-memory threading dominate single-node parallelism, MPI remains the only standard way to parallelize computation across multiple physical servers — which is what HPC clusters exist to enable.
MPI vs OpenMP: When to Use Each
The two most important parallel programming models for HPC:
| Aspect | MPI | OpenMP |
|---|---|---|
| Memory model | Distributed (each process has private memory) | Shared (all threads share memory) |
| Scope | Across nodes in a cluster | Within a single node (multiple cores) |
| Communication | Explicit send/receive or collective operations | Shared variables, no explicit communication |
| Scalability | Thousands of nodes | Limited to cores on one node |
| Programming complexity | Higher (explicit communication) | Lower (pragma directives) |
| Best for | Large-scale cluster simulations | Shared-memory parallelism, loop acceleration |
Many production HPC codes use hybrid MPI+OpenMP: one MPI process per NUMA domain (or per socket), with OpenMP threads filling the available cores within each MPI process. This reduces inter-node message count while still utilizing all compute cores.
OpenMPI Installation
# Ubuntu/Debian
apt-get update && apt-get install -y \
openmpi-bin libopenmpi-dev \
libpmix-dev # for SLURM PMIx integration
# RHEL/Rocky Linux
dnf install openmpi openmpi-devel
# Or compile from source (for InfiniBand-optimized build)
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.6.tar.gz
tar xzf openmpi-4.1.6.tar.gz && cd openmpi-4.1.6
./configure \
--prefix=/opt/openmpi-4.1.6 \
--with-ucx=/opt/ucx \ # UCX for InfiniBand/RoCE
--with-pmix \ # PMIx for SLURM integration
--enable-mpi-cxx \
--enable-shared
make -j8 && make install
# Load module (if using environment modules)
module load openmpi/4.1.6
mpirun --version
For InfiniBand clusters, always compile OpenMPI with UCX (Unified Communication X) support. UCX selects the optimal transport (InfiniBand RDMA, shared memory, TCP) automatically:
# Verify UCX/InfiniBand integration
ucx_info -d | grep ib
ompi_info | grep "Open UCX"
Rank and Communicator Concepts
MPI_COMM_WORLD: The default communicator that includes all processes in an MPI job. Every MPI application starts with this communicator.
Rank: A unique integer identifier (0 to N-1) assigned to each process within a communicator. Process 0 is conventionally the “root” that collects results or performs I/O.
Communicator: A group of processes that can communicate with each other. Custom communicators can be created for subsets of processes (useful for domain decomposition).
Essential MPI Functions
| Function | Purpose |
|---|---|
MPI_Init | Initialize MPI (must be first MPI call) |
MPI_Finalize | Clean up MPI (must be last MPI call) |
MPI_Comm_rank | Get this process’s rank |
MPI_Comm_size | Get total number of processes |
MPI_Send | Blocking send to specific rank |
MPI_Recv | Blocking receive from specific rank |
MPI_Isend | Non-blocking send |
MPI_Irecv | Non-blocking receive |
MPI_Bcast | Broadcast from root to all |
MPI_Scatter | Distribute array from root to all |
MPI_Gather | Collect values from all to root |
MPI_Reduce | Reduce values from all to root (sum, max, etc.) |
MPI_Allreduce | Reduce values and distribute result to all |
MPI_Barrier | Synchronize all processes |
Hello World in MPI/C
/* hello_mpi.c — MPI Hello World */
#include <mpi.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
int rank, size;
char hostname[256];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
gethostname(hostname, sizeof(hostname));
printf("Hello from rank %d of %d on host %s\n", rank, size, hostname);
MPI_Finalize();
return 0;
}
# Compile
mpicc -o hello_mpi hello_mpi.c
# Run on 4 processes (local)
mpirun -np 4 ./hello_mpi
# Run across multiple nodes
mpirun -np 16 --hostfile hosts.txt ./hello_mpi
# hosts.txt: one hostname per line, with optional slot count
# node01 slots=8
# node02 slots=8
Point-to-Point Communication
/* ping_pong.c — Measure latency and bandwidth between ranks 0 and 1 */
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc, char *argv[]) {
int rank, size;
int n = 1000000; // 1M doubles = 8 MB message
double *buf;
double t_start, t_end, latency, bandwidth;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (size < 2) {
fprintf(stderr, "Needs at least 2 processes\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
buf = (double *)malloc(n * sizeof(double));
if (rank == 0) {
t_start = MPI_Wtime();
MPI_Send(buf, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
MPI_Recv(buf, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
t_end = MPI_Wtime();
latency = (t_end - t_start) / 2.0;
bandwidth = (n * sizeof(double)) / latency / 1e9; // GB/s
printf("Latency: %.2f µs, Bandwidth: %.2f GB/s\n",
latency * 1e6, bandwidth);
} else if (rank == 1) {
MPI_Recv(buf, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Send(buf, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
}
free(buf);
MPI_Finalize();
return 0;
}
MPI_Reduce: Pi Calculation Example
This classic example demonstrates collective reduction where each process computes a portion of the Pi approximation:
/* mpi_pi.c — Compute Pi using Leibniz formula with MPI_Reduce */
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int rank, size;
long long n = 1000000000LL; // 1 billion terms
double local_sum = 0.0, pi = 0.0;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Each rank computes its portion of the sum */
for (long long i = rank; i < n; i += size) {
double sign = (i % 2 == 0) ? 1.0 : -1.0;
local_sum += sign / (2.0 * i + 1.0);
}
/* Sum all local results to rank 0 */
MPI_Reduce(&local_sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
pi *= 4.0;
printf("Computed Pi = %.15f (actual = 3.141592653589793)\n", pi);
}
MPI_Finalize();
return 0;
}
mpicc -O3 -o mpi_pi mpi_pi.c
mpirun -np 64 ./mpi_pi
With 64 MPI processes on a single node (or across nodes), this runs 64× faster than the serial version due to perfect parallelism (no communication overhead except the final MPI_Reduce).
SLURM Job Submission with MPI
#!/bin/bash
#SBATCH --job-name=mpi_simulation
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=06:00:00
#SBATCH --partition=compute
# Load MPI environment
module load openmpi/4.1.6
# Run with srun (preferred for SLURM+MPI integration via PMIx)
srun ./my_simulation --config=run.cfg
# Alternative: mpirun (use when srun has issues with specific applications)
mpirun -np $SLURM_NTASKS \
--map-by socket \
--bind-to core \
./my_simulation --config=run.cfg
Key SLURM environment variables available inside MPI jobs:
| Variable | Description |
|---|---|
$SLURM_JOB_ID | Job ID |
$SLURM_NTASKS | Total number of MPI processes |
$SLURM_NTASKS_PER_NODE | MPI processes per node |
$SLURM_NNODES | Total number of nodes |
$SLURM_NODELIST | List of allocated nodes |
$SLURM_SUBMIT_DIR | Directory where sbatch was run |
$SLURM_JOB_NODELIST | Node list (same as NODELIST) |
Performance Tips
Use srun instead of mpirun for SLURM jobs:
# srun uses SLURM's PMIx/PMI2 for process management, which is more reliable
srun --mpi=pmix ./simulation
# vs mpirun which manages its own process launch
CPU binding for NUMA-aware performance:
# Bind each MPI process to a specific core (prevents OS migration)
mpirun --bind-to core --map-by core ./simulation
# For hybrid MPI+OpenMP: one MPI per socket, OpenMP fills the socket
#SBATCH --ntasks-per-node=2 (2 sockets per node)
#SBATCH --cpus-per-task=96 (96 cores per socket on EPYC 9654)
export OMP_NUM_THREADS=96
srun --cpu-bind=socket ./hybrid_sim
Avoid small-message ping-pong overhead with aggregation:
# Bad: many small MPI_Send calls per timestep
for i in neighbors:
MPI_Send(small_buf, 100, MPI_DOUBLE, i, tag, comm)
# Better: pack all neighbor data into one message
MPI_Send(packed_buf, total_size, MPI_DOUBLE, neighbor, tag, comm)
Common Errors
Deadlock from mismatched send/receive:
// WRONG: All ranks send before receiving — deadlock
MPI_Send(buf, n, MPI_DOUBLE, neighbor, tag, comm); // blocks until received
MPI_Recv(buf, n, MPI_DOUBLE, neighbor, tag, comm, &status);
// CORRECT: Use non-blocking sends
MPI_Request req;
MPI_Isend(buf, n, MPI_DOUBLE, neighbor, tag, comm, &req);
MPI_Recv(rbuf, n, MPI_DOUBLE, neighbor, tag, comm, &status);
MPI_Wait(&req, &status);
// Or use MPI_Sendrecv which handles the ordering
MPI_Sendrecv(sbuf, n, MPI_DOUBLE, dest, stag,
rbuf, n, MPI_DOUBLE, src, rtag,
comm, &status);
Missing MPI_Finalize:
// Always call MPI_Finalize before exit, even on error
if (error_condition) {
MPI_Finalize(); // clean shutdown
return 1;
}
// Or use MPI_Abort for fatal errors
MPI_Abort(MPI_COMM_WORLD, 1);
Process count mismatch:
# Application hardcodes a process count that doesn't match sbatch
# Always use $SLURM_NTASKS or argv-based process count
mpirun -np $SLURM_NTASKS ./simulation # correct
mpirun -np 128 ./simulation # wrong if sbatch has --ntasks != 128
MPI is a deep standard with hundreds of functions beyond what is covered here. The key to efficient MPI programming is understanding collective operations (which are highly optimized over InfiniBand) and minimizing the number of synchronization points per unit of computation. Contact Mevasis for MPI application optimization and HPC cluster deployment services.