/ Blog

HPC Network Design: Fat-Tree Topology, VLAN Segmentation, and InfiniBand vs RoCE

HPC network design technical guide: why HPC networking is a separate discipline, fat-tree topology with three layers, four VLAN segments, InfiniBand vs. RoCE v2 decision criteria, Mevasis 5-phase methodology, and common problems including oversubscription, MTU mismatch, and PFC storms.

No matter how powerful the compute nodes, an HPC cluster’s true performance is bounded by its network. MPI-parallel applications synchronize across nodes thousands of times per second; AI training collectives transfer gigabytes between GPUs with every backward pass. Network design for HPC is not the same discipline as enterprise network design — the performance requirements and failure modes are fundamentally different.

Why HPC Networking Is a Separate Discipline

Enterprise networks are designed for:

  • Availability: Any-to-any connectivity that survives component failures
  • Throughput: High aggregate bandwidth for many independent flows
  • Cost efficiency: Best bandwidth per dollar

HPC networks are designed for:

  • Latency: Sub-microsecond message delivery between any two nodes
  • Bandwidth per node: Full line-rate access to the fabric for every node simultaneously
  • Non-blocking: Two nodes can communicate at full speed regardless of what other nodes are doing

These requirements conflict with cost efficiency. A non-blocking HPC fabric for 100 nodes requires roughly the same switch port count as 100 nodes each connected to a separate full-bandwidth switch — expensive by enterprise standards, but necessary for tight MPI performance.

Fat-Tree: Why This Topology?

Fat-tree is the standard HPC network topology for clusters from tens to thousands of nodes. Classical tree topologies have a bandwidth bottleneck at the root. Fat-tree solves this by making the uplink count at each tier equal to the downlink count — every node has equal access to full fabric bandwidth regardless of where it sits in the topology.

Three-tier fat-tree structure:

Tier 1 — Leaf (Top-of-Rack) switches: Collect compute nodes within each rack. Each compute node connects to a leaf switch with a high-speed port (100 GbE, 200 Gb/s InfiniBand). Leaf switches have equal numbers of downlinks (to compute) and uplinks (to aggregation).

Tier 2 — Aggregation switches: Connect leaf switches to each other and to core switches. The oversubscription ratio at this tier is a critical design parameter:

  • 1:1 oversubscription = non-blocking (maximum cost, maximum performance)
  • 2:1 = 50% of peak bandwidth between leaf groups (typical for most HPC)
  • 4:1 = 25% (acceptable only for loosely-coupled workloads)

Tier 3 — Core switches: Connect all aggregation switches. High port density and low latency are the priority here. For InfiniBand, these are typically director-class switches (NVIDIA Quantum-2 NDR).

                    [Core]           [Core]
                   /     \          /     \
          [Aggr-1]         [Aggr-2]         [Aggr-3]
          /  |  \          /  |  \          /  |  \
      [L1] [L2] [L3]  [L4] [L5] [L6]  [L7] [L8] [L9]
      |||   |||  |||   |||  |||  |||   |||  |||  |||
     nodes nodes nodes ...

Each leaf switch connects to multiple aggregation switches (not just one) to provide multiple paths between any two nodes. ECMP (Equal-Cost Multi-Path) or InfiniBand adaptive routing uses all paths simultaneously.

VLAN Segmentation: Four Network Tiers

A single physical fabric serves multiple logical purposes, separated by VLAN:

Management VLAN (VLAN 10): IPMI/BMC out-of-band management, PXE boot traffic, DHCP. Must remain functional even when compute nodes are failing. Accessible only from dedicated admin hosts. If a compute node floods the management network with traffic, other nodes must still be manageable via IPMI.

Compute/MPI VLAN (VLAN 20): All MPI inter-process communication. This is the performance-critical segment. L2 flat (no routing between compute nodes) to minimize hop count and latency. Jumbo frames (MTU 9000) mandatory. Never shared with other traffic types.

Storage VLAN (VLAN 30): BeeGFS or Lustre parallel filesystem I/O. Shared with compute but kept separate from management. Compute nodes access BeeGFS storage servers via this segment. Storage traffic peaks during checkpoint writes from running jobs — do not let it compete with MPI traffic.

User/External VLAN (VLAN 40): User SSH access via login nodes. Filtered by firewall. Never allows direct access to compute or management VLANs.

InfiniBand vs. RoCE v2: Decision Framework

The choice between InfiniBand and RoCE v2 for the compute network is one of the most important HPC infrastructure decisions.

InfiniBand (HDR 200 Gb/s or NDR 400 Gb/s):

  • Native RDMA — MPI and NCCL communicate without OS involvement
  • 1–2 µs latency (HDR200), < 1 µs (NDR400)
  • Purpose-built fabric with separate InfiniBand switches
  • Higher cost per port but lower total overhead for tight workloads
  • Subnet Manager (OpenSM or hardware SM) required for fabric management

RoCE v2 (RDMA over Converged Ethernet):

  • RDMA over standard 25/100 GbE infrastructure
  • 3–5 µs latency (well-tuned), much higher if misconfigured
  • Reuses existing Ethernet switch investment
  • Requires PFC (Priority Flow Control) and ECN (Explicit Congestion Notification)
  • A PFC storm on an improperly configured RoCE fabric can take down the entire network

Decision guide:

ScenarioRecommendation
New deployment, MPI-heavy simulationInfiniBand HDR or NDR
New deployment, large GPU AI clusterInfiniBand NDR
Existing 100 GbE, tight budgetRoCE v2 (with careful configuration)
Loosely-coupled workloads, burstingRoCE v2 or standard Ethernet

Mevasis 5-Phase Network Deployment Methodology

Phase 1 — Workload profiling: Measure current or model projected network traffic patterns. MPI-heavy simulation and AI all-reduce collectives have very different traffic profiles. This determines topology, port count, and speed selection.

Phase 2 — Design documentation: Produce a formal design document: physical topology diagram, VLAN assignment, IP addressing scheme, routing policy. Identify all single points of failure and evaluate redundancy options vs. cost.

Phase 3 — Infrastructure as Code configuration: Configure switches using Ansible playbooks (Mellanox/NVIDIA Onyx, Arista EOS, Cumulus Linux). Configuration as code ensures repeatability and makes change review possible before applying to production.

Phase 4 — Acceptance testing: Run ib_write_bw, ib_read_lat, iperf3, nuttcp, and nccl-tests to measure actual bandwidth and latency on every link. Compare against design targets. Any port that fails to reach 90% of theoretical bandwidth is investigated before sign-off.

Phase 5 — Production monitoring: Deploy Prometheus with SNMP Exporter for switch metrics, integrate with Grafana for dashboards. Configure alerts for: port link-down events, error counters above threshold, bandwidth utilization approaching saturation.

Common Problems and Solutions

Oversubscription surprise: When an aggregation switch has more downlink bandwidth than uplink bandwidth, jobs that communicate across racks experience dramatically lower bandwidth than intra-rack jobs. This creates non-obvious performance variability that manifests as jobs on certain node combinations being much slower. Prevention: document the oversubscription ratio at design time; verify during acceptance testing.

MTU mismatch: If any device in the MPI path has MTU 1500 when compute nodes expect MTU 9000, jumbo frames are silently fragmented or dropped. The symptom is lower-than-expected bandwidth with no obvious error. Prevention: verify MTU on every interface in the path with ip link show and ping -M do -s 8972 <remote>.

PFC storm on RoCE: Improperly configured Priority Flow Control causes backpressure to cascade through the fabric, eventually blocking all traffic. Prevention: enable ECN before enabling PFC; configure per-priority queuing; deploy fabric monitoring that detects PFC pause frame storms before they escalate.


HPC network design is a force multiplier: a well-designed fabric makes every compute investment pay off; a poorly designed one limits every application to a fraction of its potential. For HPC network architecture design, configuration, and commissioning, contact the Mevasis team.