HPC Network Design Guide: Fat-Tree, InfiniBand, RoCE v2 Selection

No matter how powerful the compute nodes, an HPC cluster’s true performance is bounded by its network. MPI-parallel applications synchronize across nodes thousands of times per second; AI training collectives transfer gigabytes between GPUs with every backward pass. Network design for HPC is not the same discipline as enterprise network design — the performance requirements and failure modes are fundamentally different.

Why HPC Networking Is a Separate Discipline

Enterprise networks are designed for:

Availability: Any-to-any connectivity that survives component failures
Throughput: High aggregate bandwidth for many independent flows
Cost efficiency: Best bandwidth per dollar

HPC networks are designed for:

Latency: Sub-microsecond message delivery between any two nodes
Bandwidth per node: Full line-rate access to the fabric for every node simultaneously
Non-blocking: Two nodes can communicate at full speed regardless of what other nodes are doing

These requirements conflict with cost efficiency. A non-blocking HPC fabric for 100 nodes requires roughly the same switch port count as 100 nodes each connected to a separate full-bandwidth switch — expensive by enterprise standards, but necessary for tight MPI performance.

Fat-Tree: Why This Topology?

Fat-tree is the standard HPC network topology for clusters from tens to thousands of nodes. Classical tree topologies have a bandwidth bottleneck at the root. Fat-tree solves this by making the uplink count at each tier equal to the downlink count — every node has equal access to full fabric bandwidth regardless of where it sits in the topology.

Three-tier fat-tree structure:

Tier 1 — Leaf (Top-of-Rack) switches: Collect compute nodes within each rack. Each compute node connects to a leaf switch with a high-speed port (100 GbE, 200 Gb/s InfiniBand). Leaf switches have equal numbers of downlinks (to compute) and uplinks (to aggregation).

Tier 2 — Aggregation switches: Connect leaf switches to each other and to core switches. The oversubscription ratio at this tier is a critical design parameter:

1:1 oversubscription = non-blocking (maximum cost, maximum performance)
2:1 = 50% of peak bandwidth between leaf groups (typical for most HPC)
4:1 = 25% (acceptable only for loosely-coupled workloads)

Tier 3 — Core switches: Connect all aggregation switches. High port density and low latency are the priority here. For InfiniBand, these are typically director-class switches (NVIDIA Quantum-2 NDR).

                    [Core]           [Core]
                   /     \          /     \
          [Aggr-1]         [Aggr-2]         [Aggr-3]
          /  |  \          /  |  \          /  |  \
      [L1] [L2] [L3]  [L4] [L5] [L6]  [L7] [L8] [L9]
      |||   |||  |||   |||  |||  |||   |||  |||  |||
     nodes nodes nodes ...

Each leaf switch connects to multiple aggregation switches (not just one) to provide multiple paths between any two nodes. ECMP (Equal-Cost Multi-Path) or InfiniBand adaptive routing uses all paths simultaneously.

VLAN Segmentation: Four Network Tiers

A single physical fabric serves multiple logical purposes, separated by VLAN:

Management VLAN (VLAN 10): IPMI/BMC out-of-band management, PXE boot traffic, DHCP. Must remain functional even when compute nodes are failing. Accessible only from dedicated admin hosts. If a compute node floods the management network with traffic, other nodes must still be manageable via IPMI.

Compute/MPI VLAN (VLAN 20): All MPI inter-process communication. This is the performance-critical segment. L2 flat (no routing between compute nodes) to minimize hop count and latency. Jumbo frames (MTU 9000) mandatory. Never shared with other traffic types.

Storage VLAN (VLAN 30): BeeGFS or Lustre parallel filesystem I/O. Shared with compute but kept separate from management. Compute nodes access BeeGFS storage servers via this segment. Storage traffic peaks during checkpoint writes from running jobs — do not let it compete with MPI traffic.

User/External VLAN (VLAN 40): User SSH access via login nodes. Filtered by firewall. Never allows direct access to compute or management VLANs.

InfiniBand vs. RoCE v2: Decision Framework

The choice between InfiniBand and RoCE v2 for the compute network is one of the most important HPC infrastructure decisions.

InfiniBand (HDR 200 Gb/s or NDR 400 Gb/s):

Native RDMA — MPI and NCCL communicate without OS involvement
1–2 µs latency (HDR200), < 1 µs (NDR400)
Purpose-built fabric with separate InfiniBand switches
Higher cost per port but lower total overhead for tight workloads
Subnet Manager (OpenSM or hardware SM) required for fabric management

RoCE v2 (RDMA over Converged Ethernet):

RDMA over standard 25/100 GbE infrastructure
3–5 µs latency (well-tuned), much higher if misconfigured
Reuses existing Ethernet switch investment
Requires PFC (Priority Flow Control) and ECN (Explicit Congestion Notification)
A PFC storm on an improperly configured RoCE fabric can take down the entire network

Decision guide:

Scenario	Recommendation
New deployment, MPI-heavy simulation	InfiniBand HDR or NDR
New deployment, large GPU AI cluster	InfiniBand NDR
Existing 100 GbE, tight budget	RoCE v2 (with careful configuration)
Loosely-coupled workloads, bursting	RoCE v2 or standard Ethernet

Mevasis 5-Phase Network Deployment Methodology

Phase 1 — Workload profiling: Measure current or model projected network traffic patterns. MPI-heavy simulation and AI all-reduce collectives have very different traffic profiles. This determines topology, port count, and speed selection.

Phase 2 — Design documentation: Produce a formal design document: physical topology diagram, VLAN assignment, IP addressing scheme, routing policy. Identify all single points of failure and evaluate redundancy options vs. cost.

Phase 3 — Infrastructure as Code configuration: Configure switches using Ansible playbooks (Mellanox/NVIDIA Onyx, Arista EOS, Cumulus Linux). Configuration as code ensures repeatability and makes change review possible before applying to production.

Phase 4 — Acceptance testing: Run ib_write_bw, ib_read_lat, iperf3, nuttcp, and nccl-tests to measure actual bandwidth and latency on every link. Compare against design targets. Any port that fails to reach 90% of theoretical bandwidth is investigated before sign-off.

Phase 5 — Production monitoring: Deploy Prometheus with SNMP Exporter for switch metrics, integrate with Grafana for dashboards. Configure alerts for: port link-down events, error counters above threshold, bandwidth utilization approaching saturation.

Common Problems and Solutions

Oversubscription surprise: When an aggregation switch has more downlink bandwidth than uplink bandwidth, jobs that communicate across racks experience dramatically lower bandwidth than intra-rack jobs. This creates non-obvious performance variability that manifests as jobs on certain node combinations being much slower. Prevention: document the oversubscription ratio at design time; verify during acceptance testing.

MTU mismatch: If any device in the MPI path has MTU 1500 when compute nodes expect MTU 9000, jumbo frames are silently fragmented or dropped. The symptom is lower-than-expected bandwidth with no obvious error. Prevention: verify MTU on every interface in the path with ip link show and ping -M do -s 8972 <remote>.

PFC storm on RoCE: Improperly configured Priority Flow Control causes backpressure to cascade through the fabric, eventually blocking all traffic. Prevention: enable ECN before enabling PFC; configure per-priority queuing; deploy fabric monitoring that detects PFC pause frame storms before they escalate.

HPC network design is a force multiplier: a well-designed fabric makes every compute investment pay off; a poorly designed one limits every application to a fraction of its potential. For HPC network architecture design, configuration, and commissioning, contact the Mevasis team.

HPC Network Design: Fat-Tree Topology, VLAN Segmentation, and InfiniBand vs RoCE