HPC Cluster Best Practices: Operations, Security, Storage, Monitoring

Running an HPC cluster well requires more than correct initial configuration. The gap between a cluster that works and one that runs efficiently, securely, and reliably for years is filled by operational discipline. These 20 practices — drawn from production HPC environments — represent the most impactful things an HPC team can do to improve cluster quality.

Category 1: Hardware Planning

1. Start with a workload analysis, not a hardware catalog. Before specifying nodes, quantify your actual workload: application types, typical job sizes, peak vs. average demand, I/O intensity. Hardware purchased for the wrong workload profile performs disappointingly for its entire useful life.

2. Buy homogeneous hardware within each generation. Identical CPU models, memory capacity, and NIC types within a partition eliminate a whole class of debugging problems. When nodes differ in subtle ways (different BIOS versions, different DIMM speeds), diagnosing performance anomalies is significantly harder.

3. Size storage before compute. I/O bottlenecks are often more limiting than compute capacity. Compute nodes that cannot read their input data fast enough run at 40% efficiency regardless of CPU count. Storage bandwidth is harder to expand after the fact than compute.

4. Plan for one upgrade cycle ahead. Network switches, storage controllers, and management hardware often have 7–10 year useful lives. Choose infrastructure that can accept the next generation of compute nodes when they arrive.

Category 2: Storage

5. Never back up scratch storage. Scratch (job temporary I/O) is transient by definition. Backing it up wastes backup capacity and creates a false sense of data protection. Invest backup resources in home directories and project storage where data has lasting value.

6. Set and enforce storage quotas. Without quotas, a single user or process can fill the shared filesystem and stop all jobs. Use per-user and per-group quotas on home and project storage. Alert at 80% of quota, hard-limit at 100%.

7. Separate scratch, project, and archive storage. Different storage tiers have different performance, cost, and reliability requirements. Mixing them on the same filesystem creates contention and makes tiered data management impossible.

8. Run IOR benchmarks before production. After installing parallel storage, run IOR to verify that aggregate bandwidth meets the design target. Discovering that storage is 3× slower than expected after users have started using the cluster is avoidable.

Category 3: Job Scheduler

9. Design partitions to match your workload profile. Create separate partitions for short interactive jobs, long batch jobs, GPU jobs, and high-memory jobs. Generic one-size-fits-all queue designs lead to priority inversion where 5-minute test jobs wait behind 7-day simulations.

10. Enable and configure fairshare. Fairshare prevents resource monopolization by heavy users while ensuring light users get priority when they need it. Without fairshare, the cluster effectively rewards whoever submits the most jobs, regardless of overall utilization.

11. Configure cgroup-based resource isolation. Without cgroup enforcement (ConstrainRAMSpace=yes, ConstrainCores=yes), a job that requests 64 GB but tries to use 256 GB will bring down the node and crash other users’ jobs. Cgroups enforce job-level resource limits at the kernel level.

Category 4: Network

12. Never mix MPI traffic with management or storage traffic. InfiniBand or RoCE fabric used for MPI must be isolated from management and storage networks. Mixed traffic causes latency variability that destroys tight-coupled MPI performance. Physical separation is ideal; VLAN isolation is an acceptable minimum.

13. Verify InfiniBand link speed after installation. Run ibstat and verify each port shows “Active” state at the expected speed (4× HDR200 = 200 Gb/s per port). Partially connected cables or wrong SFP modules silently reduce bandwidth by 2–4× without obvious errors.

14. Configure jumbo frames on the compute network. MTU 9000 reduces CPU overhead for large MPI messages. Verify all network devices in the MPI path (HCAs, switches, NICs) have matching MTU. A single device with MTU 1500 in the path silently reduces performance.

Category 5: Monitoring

15. Deploy monitoring before the first production job. Prometheus, DCGM Exporter, SLURM Exporter, and Grafana should be running before users arrive. Debugging performance problems without historical metrics is 10× harder than with them.

16. Monitor GPU temperature and ECC errors proactively. GPU failures often announce themselves with rising temperature and increasing ECC error rates days before a hard failure. Alerting on ECC double-bit errors enables proactive replacement before a job-killing failure.

17. Track queue wait time as an SLA metric. Average job wait time is the most user-visible indicator of cluster health. Track it in Grafana and set alerts when it exceeds targets. Rising wait time signals that capacity is insufficient or that a partition policy is misconfigured.

Category 6: Security

18. Segment the cluster network: management, compute, storage, and external access are four separate VLANs. Direct external access to compute or management nodes bypasses security controls. Users access the cluster via login nodes only. Management (IPMI/BMC) is accessible only from the admin network.

19. Enforce SSH key authentication, disable password authentication. Brute-force attacks against password authentication are routine. Disable password authentication in sshd_config on all cluster nodes and require SSH keys or certificates.

Category 7: Software Environment

20. Use environment modules or Lmod to manage software. Multiple users running different versions of the same software (Python 3.9 vs. 3.11, OpenMPI 4.0 vs. 4.1) coexist without conflict only with a module system. Install Lmod with a hierarchical module naming convention to prevent incompatible module combinations.

Operational Maturity Model

These practices can be viewed as a maturity progression:

Level 1 (Basic): Cluster runs, jobs execute, no monitoring.

Level 2 (Managed): Monitoring deployed, quotas enforced, backups running, SSH key authentication.

Level 3 (Optimized): Fairshare tuned to workload, cgroup isolation enabled, partition design matches job profile, IB performance verified.

Level 4 (Excellent): SLA metrics tracked and met, proactive GPU health monitoring, DR tested quarterly, user onboarding documentation current, capacity planning process in place.

Most research HPC clusters operate at Level 1–2. The difference between Level 2 and Level 4 is not hardware — it is operational discipline applied consistently.

Building and maintaining an excellent HPC cluster is an ongoing engineering discipline, not a one-time installation. Contact Mevasis for HPC cluster assessment, operations support, and managed HPC services.

HPC Cluster Best Practices: 20 Rules Across 8 Categories