HPC Storage Technical Guide: Three-Tier Architecture, BeeGFS vs Lustre, and Troubleshooting
HPC storage architecture technical guide: 3-tier design (NVMe scratch, parallel filesystem, capacity archive), BeeGFS vs Lustre decision framework, network infrastructure, common issues (metadata bottleneck, NUMA mismatch, stripe misconfiguration), IOR and mdtest benchmarks, and best practices.
HPC storage is where most cluster performance problems originate. Compute nodes that spend 40% of their time waiting for data are delivering 60% of their purchased capacity. Unlike compute hardware where the performance gap between generations is evolutionary, storage architecture decisions have binary consequences: a well-designed storage system transparently feeds compute nodes at full speed; a poorly designed one becomes the permanent bottleneck regardless of how many more CPU cores are added.
Three-Tier Storage Architecture
Production HPC storage is not a single system — it is three tiers optimized for different access patterns and cost profiles:
Tier 1 — Hot Scratch (NVMe SSD Parallel Filesystem)
- Purpose: Active job I/O — input data being read, output being written, temporary files
- Technology: NVMe SSDs in parallel filesystem configuration (BeeGFS or Lustre)
- Performance target: Aggregate sequential throughput matching compute demand (10–500 GB/s)
- Capacity: Sized to hold concurrent active workloads × 3 (input + output + temp)
- Retention: Not backed up; data deleted 30 days after last access
Tier 2 — Warm Project Storage (HDD Parallel Filesystem)
- Purpose: User project directories, group data, intermediate results
- Technology: HDD RAID in parallel filesystem or NFS
- Performance target: 1–50 GB/s aggregate (sufficient for data prep and result retrieval)
- Capacity: 500 TB – several PB depending on user base
- Retention: Backed up; quotas enforced per user/group
Tier 3 — Cold Archive
- Purpose: Published datasets, raw experimental data, long-term retention
- Technology: Tape (LTO-9) or object storage (Ceph, MinIO, cloud S3)
- Performance target: 0.5–5 GB/s (infrequent access)
- Capacity: Potentially unlimited (cost-scales linearly)
- Retention: Multi-year or indefinite
Compute Nodes
│
├─── NVMe Scratch ──────── BeeGFS (NVMe tier) ──── 10–500 GB/s
│
├─── Project Storage ───── BeeGFS/Lustre (HDD) ─── 1–50 GB/s
│
└─── Archive access ─────── Object store / Tape ─── 0.5–5 GB/s
BeeGFS vs Lustre: Decision Framework
Both are mature parallel filesystems used in production HPC environments. The right choice depends on scale, operational capabilities, and support requirements.
| Criterion | BeeGFS | Lustre |
|---|---|---|
| Setup complexity | Lower (hours to days) | Higher (days to weeks) |
| Operational complexity | Lower | Higher |
| Max proven scale | Multi-PB | Multi-10 PB (Top500 systems) |
| HA / mirroring | Buddy Mirroring | OST/MDT mirroring |
| Small file performance | Good | Variable |
| Metadata scalability | Multiple meta servers | Single MDT or DoM |
| Management tools | beegfs-ctl | lctl, lfs |
| Enterprise support | ThinkParQ | DDN, Whamcloud, Cray |
| License | Non-commercial free, commercial | GPL (open source) + commercial support |
| Community | Active | Very active (OpenSFS) |
Choose BeeGFS when:
- Cluster size is up to ~500 nodes
- Operations team is small and values simplicity
- Budget for enterprise support is limited
- Need to deploy quickly
Choose Lustre when:
- Cluster exceeds 500 nodes
- Will eventually reach petabyte scale
- Organization has Lustre expertise or budget for training
- Integration with top-tier HPC software ecosystem is required
Network Infrastructure
Storage network must be isolated from MPI/compute network:
# BeeGFS on 25 GbE storage network
# /etc/beegfs/beegfs-storage.conf
connNetFilterFile = /etc/beegfs/conn-filter.conf
# conn-filter.conf: restrict BeeGFS to storage NIC only
192.168.10.0/24 # storage VLAN
# Verify storage traffic uses correct interface
beegfs-net
# Should show only storage interface IP addresses
For high-performance scratch, InfiniBand (HDR or NDR) as the storage network dramatically improves small-block IOPS and reduces latency:
# BeeGFS over RDMA (RDMA-capable network only)
# /etc/beegfs/beegfs-client.conf
connRDMABufSize = 8192
connRDMABufNum = 70
connUseRDMA = true
Common Problems and Solutions
Metadata Bottleneck
Symptom: Commands like ls -la, find, and small file creation/deletion are slow even when storage bandwidth is adequate. Creating 100,000 small files takes minutes.
Cause: All metadata operations (file creation, permission checks, directory listings) go through the metadata server. A single metadata server becomes the bottleneck for metadata-intensive workloads.
Solution (BeeGFS):
# Add a second metadata server
beegfs-ctl --addnode --nodetype=meta
# Balance metadata across servers
beegfs-ctl --refreshentryinfo /mnt/beegfs/
# For directories with extreme small-file workloads:
# Use subdirectory structure (1M files per directory max)
mkdir -p /mnt/beegfs/scratch/job_${SLURM_JOB_ID}/{input,output,temp}
Solution (Lustre): Enable Distributed Namespace (DNE) with multiple MDTs or Distributed Object Metadata (DoM) to co-locate small file data with metadata.
NUMA Mismatch on Storage Servers
Symptom: Storage throughput is significantly below theoretical NVMe bandwidth despite low CPU utilization.
Cause: NVMe controllers connect to a specific NUMA domain. If the storage software (BeeGFS storage daemon, Lustre OSD) is bound to CPUs in the remote NUMA domain, all NVMe traffic crosses the NUMA interconnect at reduced bandwidth.
Solution:
# Identify which NUMA node each NVMe is connected to
cat /sys/block/nvme*/device/numa_node
# Bind storage daemon to matching NUMA node
numactl --cpubind=1 --membind=1 /usr/sbin/beegfs-storage
# Or use systemd service override
# /etc/systemd/system/beegfs-storage.service.d/override.conf
[Service]
ExecStart=
ExecStart=numactl --cpubind=1 /usr/sbin/beegfs-storage
Stripe Misconfiguration
Symptom: Single-client write bandwidth is good, but aggregate bandwidth from many clients does not scale as expected.
Cause: If numtargets (stripe width) is set to 1, all writes from all clients go to a single storage target — no parallelism.
Solution:
# Verify stripe settings
beegfs-ctl --getentryinfo /mnt/beegfs/scratch/my_dir
# Set optimal stripe for large-file parallel I/O
# numtargets should be at least as large as the number of parallel writers
beegfs-ctl --setpattern \
--chunksize=1m \
--numtargets=8 \
/mnt/beegfs/scratch/large_files
# For many-client small-file workloads, reduce stripe to avoid lock contention
beegfs-ctl --setpattern \
--chunksize=128k \
--numtargets=1 \
/mnt/beegfs/scratch/small_files
Benchmarking with IOR and mdtest
Run before production deployment and after any storage configuration changes:
# IOR: sequential large-file throughput (64 clients, 1 file per client)
mpirun -np 64 --hostfile hostfile \
./ior \
-t 1m -b 4g -s 1 \
-F \ # file-per-process
-w -r \ # write then read
-o /mnt/beegfs/scratch/ior_test/data \
-k # keep files for read test
# IOR: shared single file (tests metadata and locking)
mpirun -np 64 --hostfile hostfile \
./ior \
-t 4m -b 32g -s 1 \
-C -Q 1 \ # collective I/O
-w -r \
-o /mnt/beegfs/scratch/ior_shared_test
# mdtest: metadata performance (small file creation/stat/delete)
mpirun -np 64 --hostfile hostfile \
./mdtest \
-n 10000 \ # 10,000 files per process
-i 3 \ # 3 iterations
-d /mnt/beegfs/scratch/mdtest_dir
Interpreting results:
| Result | Interpretation |
|---|---|
| Aggregate write/read within 10% of design target | Pass |
| Single-client bandwidth near theoretical peak, aggregate does not scale | Stripe config or network issue |
| Aggregate bandwidth scales but with high variance | Network congestion or storage target imbalance |
| mdtest rate < 50,000 creates/sec per server | Metadata bottleneck |
Best Practices
- Separate storage NICs from compute NICs. Storage and MPI traffic competing on the same interface causes latency spikes at inopportune moments (checkpoint writes during computation).
- Monitor storage target fill rates separately. If one storage target fills faster than others (uneven distribution), IOR performance will degrade even when aggregate capacity is available. Use
beegfs-ctl --storagetargets --longnodesto monitor. - Set per-application stripe policies. Don’t use a single global stripe setting. Genomics pipelines (many small files) need different configuration than CFD simulations (few large files).
- Enable quotas before users arrive. A user who fills the shared scratch filesystem stops all other users’ jobs. Quotas should be default, not a remediation after a first incident.
HPC storage architecture is a long-term commitment. The choices made at initial deployment determine the performance ceiling and operational burden for the next 5–7 years. Contact Mevasis for HPC storage design, BeeGFS and Lustre deployment, and performance tuning services.