HPC Storage Guide: BeeGFS, Lustre, NVMe Scratch, Parallel Filesystem

HPC storage is where most cluster performance problems originate. Compute nodes that spend 40% of their time waiting for data are delivering 60% of their purchased capacity. Unlike compute hardware where the performance gap between generations is evolutionary, storage architecture decisions have binary consequences: a well-designed storage system transparently feeds compute nodes at full speed; a poorly designed one becomes the permanent bottleneck regardless of how many more CPU cores are added.

Three-Tier Storage Architecture

Production HPC storage is not a single system — it is three tiers optimized for different access patterns and cost profiles:

Tier 1 — Hot Scratch (NVMe SSD Parallel Filesystem)

Purpose: Active job I/O — input data being read, output being written, temporary files
Technology: NVMe SSDs in parallel filesystem configuration (BeeGFS or Lustre)
Performance target: Aggregate sequential throughput matching compute demand (10–500 GB/s)
Capacity: Sized to hold concurrent active workloads × 3 (input + output + temp)
Retention: Not backed up; data deleted 30 days after last access

Tier 2 — Warm Project Storage (HDD Parallel Filesystem)

Purpose: User project directories, group data, intermediate results
Technology: HDD RAID in parallel filesystem or NFS
Performance target: 1–50 GB/s aggregate (sufficient for data prep and result retrieval)
Capacity: 500 TB – several PB depending on user base
Retention: Backed up; quotas enforced per user/group

Tier 3 — Cold Archive

Purpose: Published datasets, raw experimental data, long-term retention
Technology: Tape (LTO-9) or object storage (Ceph, MinIO, cloud S3)
Performance target: 0.5–5 GB/s (infrequent access)
Capacity: Potentially unlimited (cost-scales linearly)
Retention: Multi-year or indefinite

Compute Nodes
     │
     ├─── NVMe Scratch ──────── BeeGFS (NVMe tier) ──── 10–500 GB/s
     │
     ├─── Project Storage ───── BeeGFS/Lustre (HDD) ─── 1–50 GB/s
     │
     └─── Archive access ─────── Object store / Tape ─── 0.5–5 GB/s

BeeGFS vs Lustre: Decision Framework

Both are mature parallel filesystems used in production HPC environments. The right choice depends on scale, operational capabilities, and support requirements.

Criterion	BeeGFS	Lustre
Setup complexity	Lower (hours to days)	Higher (days to weeks)
Operational complexity	Lower	Higher
Max proven scale	Multi-PB	Multi-10 PB (Top500 systems)
HA / mirroring	Buddy Mirroring	OST/MDT mirroring
Small file performance	Good	Variable
Metadata scalability	Multiple meta servers	Single MDT or DoM
Management tools	beegfs-ctl	lctl, lfs
Enterprise support	ThinkParQ	DDN, Whamcloud, Cray
License	Non-commercial free, commercial	GPL (open source) + commercial support
Community	Active	Very active (OpenSFS)

Choose BeeGFS when:

Cluster size is up to ~500 nodes
Operations team is small and values simplicity
Budget for enterprise support is limited
Need to deploy quickly

Choose Lustre when:

Cluster exceeds 500 nodes
Will eventually reach petabyte scale
Organization has Lustre expertise or budget for training
Integration with top-tier HPC software ecosystem is required

Network Infrastructure

Storage network must be isolated from MPI/compute network:

# BeeGFS on 25 GbE storage network
# /etc/beegfs/beegfs-storage.conf
connNetFilterFile = /etc/beegfs/conn-filter.conf

# conn-filter.conf: restrict BeeGFS to storage NIC only
192.168.10.0/24    # storage VLAN

# Verify storage traffic uses correct interface
beegfs-net
# Should show only storage interface IP addresses

For high-performance scratch, InfiniBand (HDR or NDR) as the storage network dramatically improves small-block IOPS and reduces latency:

# BeeGFS over RDMA (RDMA-capable network only)
# /etc/beegfs/beegfs-client.conf
connRDMABufSize = 8192
connRDMABufNum = 70
connUseRDMA = true

Common Problems and Solutions

Metadata Bottleneck

Symptom: Commands like ls -la, find, and small file creation/deletion are slow even when storage bandwidth is adequate. Creating 100,000 small files takes minutes.

Cause: All metadata operations (file creation, permission checks, directory listings) go through the metadata server. A single metadata server becomes the bottleneck for metadata-intensive workloads.

Solution (BeeGFS):

# Add a second metadata server
beegfs-ctl --addnode --nodetype=meta

# Balance metadata across servers
beegfs-ctl --refreshentryinfo /mnt/beegfs/

# For directories with extreme small-file workloads:
# Use subdirectory structure (1M files per directory max)
mkdir -p /mnt/beegfs/scratch/job_${SLURM_JOB_ID}/{input,output,temp}

Solution (Lustre): Enable Distributed Namespace (DNE) with multiple MDTs or Distributed Object Metadata (DoM) to co-locate small file data with metadata.

NUMA Mismatch on Storage Servers

Symptom: Storage throughput is significantly below theoretical NVMe bandwidth despite low CPU utilization.

Cause: NVMe controllers connect to a specific NUMA domain. If the storage software (BeeGFS storage daemon, Lustre OSD) is bound to CPUs in the remote NUMA domain, all NVMe traffic crosses the NUMA interconnect at reduced bandwidth.

Solution:

# Identify which NUMA node each NVMe is connected to
cat /sys/block/nvme*/device/numa_node

# Bind storage daemon to matching NUMA node
numactl --cpubind=1 --membind=1 /usr/sbin/beegfs-storage

# Or use systemd service override
# /etc/systemd/system/beegfs-storage.service.d/override.conf
[Service]
ExecStart=
ExecStart=numactl --cpubind=1 /usr/sbin/beegfs-storage

Stripe Misconfiguration

Symptom: Single-client write bandwidth is good, but aggregate bandwidth from many clients does not scale as expected.

Cause: If numtargets (stripe width) is set to 1, all writes from all clients go to a single storage target — no parallelism.

Solution:

# Verify stripe settings
beegfs-ctl --getentryinfo /mnt/beegfs/scratch/my_dir

# Set optimal stripe for large-file parallel I/O
# numtargets should be at least as large as the number of parallel writers
beegfs-ctl --setpattern \
  --chunksize=1m \
  --numtargets=8 \
  /mnt/beegfs/scratch/large_files

# For many-client small-file workloads, reduce stripe to avoid lock contention
beegfs-ctl --setpattern \
  --chunksize=128k \
  --numtargets=1 \
  /mnt/beegfs/scratch/small_files

Benchmarking with IOR and mdtest

Run before production deployment and after any storage configuration changes:

# IOR: sequential large-file throughput (64 clients, 1 file per client)
mpirun -np 64 --hostfile hostfile \
  ./ior \
  -t 1m -b 4g -s 1 \
  -F \                     # file-per-process
  -w -r \                  # write then read
  -o /mnt/beegfs/scratch/ior_test/data \
  -k                       # keep files for read test

# IOR: shared single file (tests metadata and locking)
mpirun -np 64 --hostfile hostfile \
  ./ior \
  -t 4m -b 32g -s 1 \
  -C -Q 1 \                # collective I/O
  -w -r \
  -o /mnt/beegfs/scratch/ior_shared_test

# mdtest: metadata performance (small file creation/stat/delete)
mpirun -np 64 --hostfile hostfile \
  ./mdtest \
  -n 10000 \               # 10,000 files per process
  -i 3 \                   # 3 iterations
  -d /mnt/beegfs/scratch/mdtest_dir

Interpreting results:

Result	Interpretation
Aggregate write/read within 10% of design target	Pass
Single-client bandwidth near theoretical peak, aggregate does not scale	Stripe config or network issue
Aggregate bandwidth scales but with high variance	Network congestion or storage target imbalance
mdtest rate < 50,000 creates/sec per server	Metadata bottleneck

Best Practices

Separate storage NICs from compute NICs. Storage and MPI traffic competing on the same interface causes latency spikes at inopportune moments (checkpoint writes during computation).
Monitor storage target fill rates separately. If one storage target fills faster than others (uneven distribution), IOR performance will degrade even when aggregate capacity is available. Use beegfs-ctl --storagetargets --longnodes to monitor.
Set per-application stripe policies. Don’t use a single global stripe setting. Genomics pipelines (many small files) need different configuration than CFD simulations (few large files).
Enable quotas before users arrive. A user who fills the shared scratch filesystem stops all other users’ jobs. Quotas should be default, not a remediation after a first incident.

HPC storage architecture is a long-term commitment. The choices made at initial deployment determine the performance ceiling and operational burden for the next 5–7 years. Contact Mevasis for HPC storage design, BeeGFS and Lustre deployment, and performance tuning services.

HPC Storage Technical Guide: Three-Tier Architecture, BeeGFS vs Lustre, and Troubleshooting