HPC Network Design: Fat-Tree Topology, VLAN Segmentation, and InfiniBand vs RoCE
HPC network design technical guide: why HPC networking is a separate discipline, fat-tree topology with three layers, four VLAN segments, InfiniBand vs. RoCE v2 decision criteria, Mevasis 5-phase methodology, and common problems including oversubscription, MTU mismatch, and PFC storms.
No matter how powerful the compute nodes, an HPC cluster’s true performance is bounded by its network. MPI-parallel applications synchronize across nodes thousands of times per second; AI training collectives transfer gigabytes between GPUs with every backward pass. Network design for HPC is not the same discipline as enterprise network design — the performance requirements and failure modes are fundamentally different.
Why HPC Networking Is a Separate Discipline
Enterprise networks are designed for:
- Availability: Any-to-any connectivity that survives component failures
- Throughput: High aggregate bandwidth for many independent flows
- Cost efficiency: Best bandwidth per dollar
HPC networks are designed for:
- Latency: Sub-microsecond message delivery between any two nodes
- Bandwidth per node: Full line-rate access to the fabric for every node simultaneously
- Non-blocking: Two nodes can communicate at full speed regardless of what other nodes are doing
These requirements conflict with cost efficiency. A non-blocking HPC fabric for 100 nodes requires roughly the same switch port count as 100 nodes each connected to a separate full-bandwidth switch — expensive by enterprise standards, but necessary for tight MPI performance.
Fat-Tree: Why This Topology?
Fat-tree is the standard HPC network topology for clusters from tens to thousands of nodes. Classical tree topologies have a bandwidth bottleneck at the root. Fat-tree solves this by making the uplink count at each tier equal to the downlink count — every node has equal access to full fabric bandwidth regardless of where it sits in the topology.
Three-tier fat-tree structure:
Tier 1 — Leaf (Top-of-Rack) switches: Collect compute nodes within each rack. Each compute node connects to a leaf switch with a high-speed port (100 GbE, 200 Gb/s InfiniBand). Leaf switches have equal numbers of downlinks (to compute) and uplinks (to aggregation).
Tier 2 — Aggregation switches: Connect leaf switches to each other and to core switches. The oversubscription ratio at this tier is a critical design parameter:
- 1:1 oversubscription = non-blocking (maximum cost, maximum performance)
- 2:1 = 50% of peak bandwidth between leaf groups (typical for most HPC)
- 4:1 = 25% (acceptable only for loosely-coupled workloads)
Tier 3 — Core switches: Connect all aggregation switches. High port density and low latency are the priority here. For InfiniBand, these are typically director-class switches (NVIDIA Quantum-2 NDR).
[Core] [Core]
/ \ / \
[Aggr-1] [Aggr-2] [Aggr-3]
/ | \ / | \ / | \
[L1] [L2] [L3] [L4] [L5] [L6] [L7] [L8] [L9]
||| ||| ||| ||| ||| ||| ||| ||| |||
nodes nodes nodes ...
Each leaf switch connects to multiple aggregation switches (not just one) to provide multiple paths between any two nodes. ECMP (Equal-Cost Multi-Path) or InfiniBand adaptive routing uses all paths simultaneously.
VLAN Segmentation: Four Network Tiers
A single physical fabric serves multiple logical purposes, separated by VLAN:
Management VLAN (VLAN 10): IPMI/BMC out-of-band management, PXE boot traffic, DHCP. Must remain functional even when compute nodes are failing. Accessible only from dedicated admin hosts. If a compute node floods the management network with traffic, other nodes must still be manageable via IPMI.
Compute/MPI VLAN (VLAN 20): All MPI inter-process communication. This is the performance-critical segment. L2 flat (no routing between compute nodes) to minimize hop count and latency. Jumbo frames (MTU 9000) mandatory. Never shared with other traffic types.
Storage VLAN (VLAN 30): BeeGFS or Lustre parallel filesystem I/O. Shared with compute but kept separate from management. Compute nodes access BeeGFS storage servers via this segment. Storage traffic peaks during checkpoint writes from running jobs — do not let it compete with MPI traffic.
User/External VLAN (VLAN 40): User SSH access via login nodes. Filtered by firewall. Never allows direct access to compute or management VLANs.
InfiniBand vs. RoCE v2: Decision Framework
The choice between InfiniBand and RoCE v2 for the compute network is one of the most important HPC infrastructure decisions.
InfiniBand (HDR 200 Gb/s or NDR 400 Gb/s):
- Native RDMA — MPI and NCCL communicate without OS involvement
- 1–2 µs latency (HDR200), < 1 µs (NDR400)
- Purpose-built fabric with separate InfiniBand switches
- Higher cost per port but lower total overhead for tight workloads
- Subnet Manager (OpenSM or hardware SM) required for fabric management
RoCE v2 (RDMA over Converged Ethernet):
- RDMA over standard 25/100 GbE infrastructure
- 3–5 µs latency (well-tuned), much higher if misconfigured
- Reuses existing Ethernet switch investment
- Requires PFC (Priority Flow Control) and ECN (Explicit Congestion Notification)
- A PFC storm on an improperly configured RoCE fabric can take down the entire network
Decision guide:
| Scenario | Recommendation |
|---|---|
| New deployment, MPI-heavy simulation | InfiniBand HDR or NDR |
| New deployment, large GPU AI cluster | InfiniBand NDR |
| Existing 100 GbE, tight budget | RoCE v2 (with careful configuration) |
| Loosely-coupled workloads, bursting | RoCE v2 or standard Ethernet |
Mevasis 5-Phase Network Deployment Methodology
Phase 1 — Workload profiling: Measure current or model projected network traffic patterns. MPI-heavy simulation and AI all-reduce collectives have very different traffic profiles. This determines topology, port count, and speed selection.
Phase 2 — Design documentation: Produce a formal design document: physical topology diagram, VLAN assignment, IP addressing scheme, routing policy. Identify all single points of failure and evaluate redundancy options vs. cost.
Phase 3 — Infrastructure as Code configuration: Configure switches using Ansible playbooks (Mellanox/NVIDIA Onyx, Arista EOS, Cumulus Linux). Configuration as code ensures repeatability and makes change review possible before applying to production.
Phase 4 — Acceptance testing: Run ib_write_bw, ib_read_lat, iperf3, nuttcp, and nccl-tests to measure actual bandwidth and latency on every link. Compare against design targets. Any port that fails to reach 90% of theoretical bandwidth is investigated before sign-off.
Phase 5 — Production monitoring: Deploy Prometheus with SNMP Exporter for switch metrics, integrate with Grafana for dashboards. Configure alerts for: port link-down events, error counters above threshold, bandwidth utilization approaching saturation.
Common Problems and Solutions
Oversubscription surprise: When an aggregation switch has more downlink bandwidth than uplink bandwidth, jobs that communicate across racks experience dramatically lower bandwidth than intra-rack jobs. This creates non-obvious performance variability that manifests as jobs on certain node combinations being much slower. Prevention: document the oversubscription ratio at design time; verify during acceptance testing.
MTU mismatch: If any device in the MPI path has MTU 1500 when compute nodes expect MTU 9000, jumbo frames are silently fragmented or dropped. The symptom is lower-than-expected bandwidth with no obvious error. Prevention: verify MTU on every interface in the path with ip link show and ping -M do -s 8972 <remote>.
PFC storm on RoCE: Improperly configured Priority Flow Control causes backpressure to cascade through the fabric, eventually blocking all traffic. Prevention: enable ECN before enabling PFC; configure per-priority queuing; deploy fabric monitoring that detects PFC pause frame storms before they escalate.
HPC network design is a force multiplier: a well-designed fabric makes every compute investment pay off; a poorly designed one limits every application to a fraction of its potential. For HPC network architecture design, configuration, and commissioning, contact the Mevasis team.