/ Blog

HPC Storage Technical Guide: Three-Tier Architecture, BeeGFS vs Lustre, and Troubleshooting

HPC storage architecture technical guide: 3-tier design (NVMe scratch, parallel filesystem, capacity archive), BeeGFS vs Lustre decision framework, network infrastructure, common issues (metadata bottleneck, NUMA mismatch, stripe misconfiguration), IOR and mdtest benchmarks, and best practices.

HPC storage is where most cluster performance problems originate. Compute nodes that spend 40% of their time waiting for data are delivering 60% of their purchased capacity. Unlike compute hardware where the performance gap between generations is evolutionary, storage architecture decisions have binary consequences: a well-designed storage system transparently feeds compute nodes at full speed; a poorly designed one becomes the permanent bottleneck regardless of how many more CPU cores are added.

Three-Tier Storage Architecture

Production HPC storage is not a single system — it is three tiers optimized for different access patterns and cost profiles:

Tier 1 — Hot Scratch (NVMe SSD Parallel Filesystem)

  • Purpose: Active job I/O — input data being read, output being written, temporary files
  • Technology: NVMe SSDs in parallel filesystem configuration (BeeGFS or Lustre)
  • Performance target: Aggregate sequential throughput matching compute demand (10–500 GB/s)
  • Capacity: Sized to hold concurrent active workloads × 3 (input + output + temp)
  • Retention: Not backed up; data deleted 30 days after last access

Tier 2 — Warm Project Storage (HDD Parallel Filesystem)

  • Purpose: User project directories, group data, intermediate results
  • Technology: HDD RAID in parallel filesystem or NFS
  • Performance target: 1–50 GB/s aggregate (sufficient for data prep and result retrieval)
  • Capacity: 500 TB – several PB depending on user base
  • Retention: Backed up; quotas enforced per user/group

Tier 3 — Cold Archive

  • Purpose: Published datasets, raw experimental data, long-term retention
  • Technology: Tape (LTO-9) or object storage (Ceph, MinIO, cloud S3)
  • Performance target: 0.5–5 GB/s (infrequent access)
  • Capacity: Potentially unlimited (cost-scales linearly)
  • Retention: Multi-year or indefinite
Compute Nodes
     │
     ├─── NVMe Scratch ──────── BeeGFS (NVMe tier) ──── 10–500 GB/s
     │
     ├─── Project Storage ───── BeeGFS/Lustre (HDD) ─── 1–50 GB/s
     │
     └─── Archive access ─────── Object store / Tape ─── 0.5–5 GB/s

BeeGFS vs Lustre: Decision Framework

Both are mature parallel filesystems used in production HPC environments. The right choice depends on scale, operational capabilities, and support requirements.

CriterionBeeGFSLustre
Setup complexityLower (hours to days)Higher (days to weeks)
Operational complexityLowerHigher
Max proven scaleMulti-PBMulti-10 PB (Top500 systems)
HA / mirroringBuddy MirroringOST/MDT mirroring
Small file performanceGoodVariable
Metadata scalabilityMultiple meta serversSingle MDT or DoM
Management toolsbeegfs-ctllctl, lfs
Enterprise supportThinkParQDDN, Whamcloud, Cray
LicenseNon-commercial free, commercialGPL (open source) + commercial support
CommunityActiveVery active (OpenSFS)

Choose BeeGFS when:

  • Cluster size is up to ~500 nodes
  • Operations team is small and values simplicity
  • Budget for enterprise support is limited
  • Need to deploy quickly

Choose Lustre when:

  • Cluster exceeds 500 nodes
  • Will eventually reach petabyte scale
  • Organization has Lustre expertise or budget for training
  • Integration with top-tier HPC software ecosystem is required

Network Infrastructure

Storage network must be isolated from MPI/compute network:

# BeeGFS on 25 GbE storage network
# /etc/beegfs/beegfs-storage.conf
connNetFilterFile = /etc/beegfs/conn-filter.conf

# conn-filter.conf: restrict BeeGFS to storage NIC only
192.168.10.0/24    # storage VLAN

# Verify storage traffic uses correct interface
beegfs-net
# Should show only storage interface IP addresses

For high-performance scratch, InfiniBand (HDR or NDR) as the storage network dramatically improves small-block IOPS and reduces latency:

# BeeGFS over RDMA (RDMA-capable network only)
# /etc/beegfs/beegfs-client.conf
connRDMABufSize = 8192
connRDMABufNum = 70
connUseRDMA = true

Common Problems and Solutions

Metadata Bottleneck

Symptom: Commands like ls -la, find, and small file creation/deletion are slow even when storage bandwidth is adequate. Creating 100,000 small files takes minutes.

Cause: All metadata operations (file creation, permission checks, directory listings) go through the metadata server. A single metadata server becomes the bottleneck for metadata-intensive workloads.

Solution (BeeGFS):

# Add a second metadata server
beegfs-ctl --addnode --nodetype=meta

# Balance metadata across servers
beegfs-ctl --refreshentryinfo /mnt/beegfs/

# For directories with extreme small-file workloads:
# Use subdirectory structure (1M files per directory max)
mkdir -p /mnt/beegfs/scratch/job_${SLURM_JOB_ID}/{input,output,temp}

Solution (Lustre): Enable Distributed Namespace (DNE) with multiple MDTs or Distributed Object Metadata (DoM) to co-locate small file data with metadata.

NUMA Mismatch on Storage Servers

Symptom: Storage throughput is significantly below theoretical NVMe bandwidth despite low CPU utilization.

Cause: NVMe controllers connect to a specific NUMA domain. If the storage software (BeeGFS storage daemon, Lustre OSD) is bound to CPUs in the remote NUMA domain, all NVMe traffic crosses the NUMA interconnect at reduced bandwidth.

Solution:

# Identify which NUMA node each NVMe is connected to
cat /sys/block/nvme*/device/numa_node

# Bind storage daemon to matching NUMA node
numactl --cpubind=1 --membind=1 /usr/sbin/beegfs-storage

# Or use systemd service override
# /etc/systemd/system/beegfs-storage.service.d/override.conf
[Service]
ExecStart=
ExecStart=numactl --cpubind=1 /usr/sbin/beegfs-storage

Stripe Misconfiguration

Symptom: Single-client write bandwidth is good, but aggregate bandwidth from many clients does not scale as expected.

Cause: If numtargets (stripe width) is set to 1, all writes from all clients go to a single storage target — no parallelism.

Solution:

# Verify stripe settings
beegfs-ctl --getentryinfo /mnt/beegfs/scratch/my_dir

# Set optimal stripe for large-file parallel I/O
# numtargets should be at least as large as the number of parallel writers
beegfs-ctl --setpattern \
  --chunksize=1m \
  --numtargets=8 \
  /mnt/beegfs/scratch/large_files

# For many-client small-file workloads, reduce stripe to avoid lock contention
beegfs-ctl --setpattern \
  --chunksize=128k \
  --numtargets=1 \
  /mnt/beegfs/scratch/small_files

Benchmarking with IOR and mdtest

Run before production deployment and after any storage configuration changes:

# IOR: sequential large-file throughput (64 clients, 1 file per client)
mpirun -np 64 --hostfile hostfile \
  ./ior \
  -t 1m -b 4g -s 1 \
  -F \                     # file-per-process
  -w -r \                  # write then read
  -o /mnt/beegfs/scratch/ior_test/data \
  -k                       # keep files for read test

# IOR: shared single file (tests metadata and locking)
mpirun -np 64 --hostfile hostfile \
  ./ior \
  -t 4m -b 32g -s 1 \
  -C -Q 1 \                # collective I/O
  -w -r \
  -o /mnt/beegfs/scratch/ior_shared_test

# mdtest: metadata performance (small file creation/stat/delete)
mpirun -np 64 --hostfile hostfile \
  ./mdtest \
  -n 10000 \               # 10,000 files per process
  -i 3 \                   # 3 iterations
  -d /mnt/beegfs/scratch/mdtest_dir

Interpreting results:

ResultInterpretation
Aggregate write/read within 10% of design targetPass
Single-client bandwidth near theoretical peak, aggregate does not scaleStripe config or network issue
Aggregate bandwidth scales but with high varianceNetwork congestion or storage target imbalance
mdtest rate < 50,000 creates/sec per serverMetadata bottleneck

Best Practices

  • Separate storage NICs from compute NICs. Storage and MPI traffic competing on the same interface causes latency spikes at inopportune moments (checkpoint writes during computation).
  • Monitor storage target fill rates separately. If one storage target fills faster than others (uneven distribution), IOR performance will degrade even when aggregate capacity is available. Use beegfs-ctl --storagetargets --longnodes to monitor.
  • Set per-application stripe policies. Don’t use a single global stripe setting. Genomics pipelines (many small files) need different configuration than CFD simulations (few large files).
  • Enable quotas before users arrive. A user who fills the shared scratch filesystem stops all other users’ jobs. Quotas should be default, not a remediation after a first incident.

HPC storage architecture is a long-term commitment. The choices made at initial deployment determine the performance ceiling and operational burden for the next 5–7 years. Contact Mevasis for HPC storage design, BeeGFS and Lustre deployment, and performance tuning services.