InfiniBand Technical Guide: Fabric Architecture, Subnet Manager, Installation, and Troubleshooting
InfiniBand fabric architecture technical guide: why InfiniBand (RDMA, OS bypass), HDR200 vs NDR400 comparison table, fat-tree topology, Subnet Manager (OpenSM) configuration, 4-phase installation process, acceptance tests, common problems, and production monitoring.
InfiniBand is the network fabric that makes large-scale parallel HPC possible. While Ethernet handles enterprise traffic adequately, the fundamental architecture of TCP/IP networking — with its OS kernel processing overhead, packet buffering, and millisecond-scale congestion recovery — is incompatible with the microsecond synchronization requirements of tight MPI workloads. InfiniBand solves this with a different approach: RDMA, direct memory access, and a dedicated fabric with hardware-based flow control.
Why InfiniBand?
The fundamental problem with Ethernet for MPI: Every message in a standard TCP/IP stack requires the operating system kernel to copy data between buffers, manage packet framing, and process acknowledgments. For small MPI messages (4–256 bytes) exchanged thousands of times per second, this kernel overhead is the dominant cost — not network transit time.
InfiniBand’s solution — RDMA: Remote Direct Memory Access allows a node to read from or write to another node’s memory directly, completely bypassing both nodes’ CPUs and operating systems. Data flows from one application buffer, across the InfiniBand fabric, and into another application buffer — the OS is not involved. Result: 1–2 µs end-to-end latency vs. 50–200 µs for standard Ethernet.
Impact on HPC workloads:
- MPI parallel simulations: Every barrier synchronization and point-to-point transfer benefits directly from sub-2 µs latency. For a simulation with 1000 MPI sync points per second, the difference is 2 ms vs. 200 ms per second of overhead.
- Distributed deep learning: All-reduce operations in NCCL for gradient synchronization are bandwidth-bound. InfiniBand NDR at 400 Gb/s delivers 50 GB/s per port — 5× more than standard 100 GbE.
- Large-scale genomics: Distributed memory algorithms like De Bruijn graph assembly require continuous data exchange across nodes. InfiniBand makes this practical at scale.
Speed Tier Selection: HDR200 vs NDR400
| Standard | Port Speed | Typical Use Case | Notes |
|---|---|---|---|
| HDR (High Data Rate) | 200 Gb/s | Enterprise HPC, university clusters | Mature ecosystem, lower cost |
| NDR (Next Data Rate) | 400 Gb/s | Large GPU clusters, AI infrastructure | Latest generation, higher port cost |
| HDR100 | 100 Gb/s | Small clusters, transition systems | Half of HDR200 via cable bifurcation |
| EDR | 100 Gb/s | Legacy systems | Not recommended for new deployment |
HDR200 vs NDR400 decision:
HDR200 (NVIDIA Mellanox ConnectX-6 HCA + Quantum switches) remains the right choice for most enterprise HPC deployments in 2025. The ecosystem is mature, driver support is stable, and the price-per-port is well-established. For clusters up to 500 nodes running CPU simulations, HDR200 is rarely a bottleneck.
NDR400 (NVIDIA ConnectX-7 HCA + Quantum-2 switches) is appropriate when:
- GPU cluster with all-to-all communication (NCCL all-reduce)
- More than 128 GPU nodes requiring non-blocking fabric
- Future-proofing a large-scale investment
The cost difference per port is significant (~40% premium for NDR over HDR). Justify the premium with workload bandwidth measurements.
Fat-Tree Topology
InfiniBand fabrics almost universally use fat-tree topology:
[Core Switch]
/ | \
[Leaf-1] [Leaf-2] [Leaf-3]
/ | | \ \ / | | \ \ / | | \ \
N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12
Key properties:
- Each leaf switch connects compute nodes to the fabric
- Core switches connect leaf switches to each other
- Any two nodes can communicate without going through more than 2 switch hops (leaf → core → leaf)
- Fat-tree is inherently non-blocking when configured with equal uplinks and downlinks
Oversubscription ratio is the critical design parameter:
- 1:1 (non-blocking): Every node can simultaneously communicate with any other at full port speed. Maximum cost.
- 2:1: Peak bandwidth between pods is half of non-blocking. Acceptable for most HPC.
- 4:1 or higher: Only for loosely-coupled workloads; tight MPI will experience significant congestion.
Subnet Manager Configuration
Every InfiniBand fabric requires exactly one active Subnet Manager (SM). The SM discovers the fabric topology, assigns LIDs (Local Identifiers) to every port, computes routing tables, and distributes them to all switches.
OpenSM is the most widely used software SM, suitable for clusters up to several hundred nodes:
# Install OpenSM
apt-get install opensm # Debian/Ubuntu
yum install opensm # RHEL/CentOS
# Start and enable OpenSM
systemctl enable --now opensm
# Verify OpenSM is running and has completed LID assignment
cat /var/log/opensm/opensm.log | tail -20
# Look for "Routing was recalculated" message
# Configure OpenSM for adaptive routing (performance optimization)
# /etc/opensm/opensm.conf
routing_engine ar_ftree # Adaptive routing with fat-tree optimizer
reassign_lids 0 # Don't reassign LIDs on restart (reduces fabric disruption)
For large fabrics (> 500 nodes), hardware-based SM (built into enterprise InfiniBand switches) is preferred. Run OpenSM in standby mode on a host for failover.
# Run OpenSM in standby (SM priority = 1, active SM = priority 14)
opensm -p 1 -d 0 # priority 1, daemon mode
Installation: 4-Phase Process
Phase 1 — Design and Capacity Planning
Before purchasing hardware, answer:
- How many compute nodes now and in 3 years?
- What is the traffic pattern (all-to-all vs. nearest-neighbor vs. sparse)?
- What is the per-node GPU count (determines aggregate bandwidth requirement)?
- What oversubscription ratio is acceptable for the workload?
These answers determine switch model, port count, and HCA selection. Selecting too few switch ports at initial deployment forces a complete fabric redesign at scale.
Phase 2 — Physical Installation
# After physical cabling:
# 1. Power on switches before compute nodes
# 2. Verify all ports show link-up in switch management console
# 3. Check cable labeling matches as-built diagram
# After server HCAs are installed:
lspci | grep -i mellanox # verify HCA is recognized by OS
ibstat # verify port is Active
Every InfiniBand cable and transceiver should be tested before the servers are loaded. A misseated QSFP28/QSFP56 cable that shows as “Active” at 50% of expected speed is much harder to find once the cluster is populated.
Phase 3 — MLNX_OFED Installation
MLNX_OFED (Mellanox OpenFabrics Enterprise Distribution) provides all InfiniBand drivers and user-space libraries:
# Download MLNX_OFED (match to OS version and CUDA version if GPU nodes)
# From https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/
# Install
tar xzf MLNX_OFED_LINUX-24.01-0.3.3.1-ubuntu22.04-x86_64.tgz
cd MLNX_OFED_LINUX-24.01-0.3.3.1-ubuntu22.04-x86_64
./mlnxofedinstall --with-nfsrdma --with-nvmf
# Verify InfiniBand stack
ibstat # Shows port state, speed, LID
ibv_devinfo # Shows device capabilities
# Set optimal parameters for HPC
echo "options mlx5_core enable_roce=N" > /etc/modprobe.d/mlx5.conf
echo "net.ipv4.tcp_timestamps = 0" >> /etc/sysctl.d/mlx5.conf
Phase 4 — Acceptance Tests
Run all acceptance tests before declaring the fabric production-ready:
# Bandwidth test between two nodes (run on server1)
ib_write_bw -d mlx5_0 -i 1 --duration 30
# On server2 (client):
ib_write_bw -d mlx5_0 -i 1 --duration 30 server1
# Expected for HDR200: ~23 GB/s effective (from 200 Gb/s theoretical)
# Latency test
ib_write_lat -d mlx5_0 -i 1 # server
ib_write_lat -d mlx5_0 -i 1 server1 # client
# Expected: < 1.5 µs for HDR, < 1 µs for NDR
# All-to-all test across full fabric
mpirun -np 64 -hostfile hostfile ./IMB-MPI1 Allreduce
# Fabric health check
ibnetdiscover | grep "^DR" # all switches discovered
ibdiagnet # comprehensive fabric diagnostics
Common Problems
Port link-down after installation: Check SFP/QSFP transceiver seating (remove and reinsert), cable continuity, and switch port speed auto-negotiation settings. ibstat port state should be “Active”; “Initialize” or “Down” indicates a physical layer issue.
Bandwidth below 90% of theoretical: Common causes: PCIe Gen3 vs Gen4 mismatch (bottleneck at HCA-to-CPU interface), wrong link width (check for 4× vs 1× connection), oversubscribed fabric path (sender and receiver on same leaf switch connected via congested uplinks).
Multiple Subnet Manager conflict: If two SM instances are both in “Master” state, the fabric will oscillate and performance will be erratic. Run sminfo on any node — output shows which SM is currently master and its priority. Disable the unwanted SM.
RDMA connection refused: Firewall iptables rules blocking IPoIB or rdma_cm ports. InfiniBand communicates via its own protocol stack but uses port-based addressing — ensure InfiniBand ports are not filtered by iptables rules that match the IPoIB interface.
Production Monitoring
# Continuous port error monitoring (run via cron or monitoring agent)
perfquery -a # all ports on local node
perfquery -x # extended counters including ECC
# Key error counters to monitor:
# SymbolErrorCounter > 0: physical layer errors (cable/transceiver)
# PortRcvErrors > 0: receive errors
# PortXmitDiscards > 0: transmit queue discards (congestion)
# PortRcvRemotePhysErrors > 0: remote physical errors
# Prometheus: monitor via Mellanox NEO or infiniband_exporter
# Alert: any port with increasing SymbolErrorCounter rate
# Fabric bandwidth utilization
perfquery -c # clear counters
sleep 60
perfquery -a | grep -E "Xmit|Rcv" # bandwidth over 60 seconds
InfiniBand fabric quality is the single most significant factor in whether an HPC cluster achieves its designed performance. Proper installation, acceptance testing, and ongoing monitoring ensure the fabric delivers on the hardware investment. Contact Mevasis for InfiniBand design, deployment, and commissioning services.