/ Blog

HPC Frequently Asked Questions: Architecture, SLURM, Storage, Security, and Cloud

Comprehensive HPC FAQ covering: what is HPC vs enterprise servers, buy vs rent, SLURM scheduler, InfiniBand networking, storage design, security, installation process, maintenance, and cloud vs on-premise comparison.

This FAQ addresses the most common questions organizations ask when first approaching HPC infrastructure — from fundamental definitions to procurement decisions and operational concerns.

What Is HPC?

Q: What exactly is HPC, and how is it different from an enterprise server?

HPC (High Performance Computing) refers to computing systems designed for sustained, parallel workloads that exceed the capacity of any single server. The distinction from enterprise servers is architectural:

  • Enterprise servers optimize for transaction throughput, high availability, and mixed workloads. A typical 2-socket server handles web requests, database queries, and virtualization efficiently.
  • HPC clusters optimize for peak floating-point performance, memory bandwidth, and inter-node communication. A 100-node HPC cluster runs a single simulation across all nodes simultaneously.

Key differences:

AspectEnterprise ServerHPC Cluster
Scaling modelVertical (bigger server)Horizontal (more nodes)
Interconnect10/25 GbEInfiniBand (100–400 Gb/s)
StorageSAN / NASParallel filesystem (BeeGFS/Lustre)
Job managementVM schedulerHPC job scheduler (SLURM)
Optimization targetThroughput/availabilityPeak compute/bandwidth

Q: What kinds of problems require HPC?

HPC is necessary when:

  • The computation cannot fit on a single server (model too large, simulation too complex)
  • The computation is time-sensitive and must complete in hours, not weeks
  • Multiple researchers need to run simultaneous workloads on shared infrastructure
  • The dataset is too large to move (analysis must come to the data)

Typical domains: computational fluid dynamics (CFD), molecular dynamics, finite element analysis (FEA), machine learning / AI training, genomics/bioinformatics, seismic processing, weather modeling, financial Monte Carlo simulation.


Buy vs. Rent

Q: Should we buy our own HPC cluster or use cloud HPC?

Both approaches have valid use cases. Key decision factors:

FactorOn-PremiseCloud HPC
Capital requirementHigh upfrontMinimal
Unit cost at scale3–5× cheaper/flopConvenience premium
Data sovereigntyFull controlData leaves premises
Peak flexibilityFixed capacityElastic
Latency-sensitive MPIInfiniBand availableCloud network may limit
Total utilizationMust plan for peaksPay for peaks only

Rule of thumb: If your utilization will exceed 60–70% consistently, on-premise is more cost-effective. Below 40%, cloud or leased HPC is usually cheaper when total cost of ownership is calculated honestly.

Hybrid approach (on-premise baseline + cloud burst for peaks) is increasingly common and often optimal.


SLURM and Job Scheduling

Q: What is SLURM and why does every HPC cluster use it?

SLURM (Simple Linux Utility for Resource Management) is an open-source HPC workload manager. It handles:

  • Accepting job submissions from users
  • Queuing jobs and deciding scheduling order based on priority, fairshare, and resources
  • Allocating nodes to jobs and launching applications on them
  • Tracking resource usage for billing and capacity planning

SLURM is used on more than 60% of the Top500 supercomputers because it is: open source (no license cost), highly configurable, scalable to millions of cores, and has an enormous ecosystem of integrations.

Q: A job has been pending in the queue for 4 hours. Why?

Common causes, check in order:

# See why a specific job is pending
squeue -j <jobid> --start
scontrol show job <jobid> | grep Reason
SLURM Pending ReasonMeaning
ResourcesRequested resources not currently available
PriorityResources available but higher-priority job is first in line
QOSMaxCPUPerUserLimitUser has hit their CPU quota
PartitionNodeLimitRequesting more nodes than partition allows
InvalidQOSQOS not assigned to user’s account
DependencyJob depends on another job that hasn’t finished

Network Technologies

Q: Why do HPC clusters use InfiniBand instead of Ethernet?

Three reasons: latency, bandwidth, and CPU overhead.

Latency: InfiniBand delivers 1–2 µs end-to-end latency. Standard Ethernet delivers 50–200 µs. For MPI applications that synchronize thousands of times per second, this difference is decisive.

Bandwidth: InfiniBand HDR delivers 200 Gb/s per port; NDR delivers 400 Gb/s. Standard 100 GbE is cost-competitive on bandwidth but not on latency.

CPU overhead (RDMA): InfiniBand uses RDMA (Remote Direct Memory Access), which bypasses the operating system and CPU for data transfers. Standard TCP/IP Ethernet requires the CPU to process every packet, consuming cores that could otherwise do computation.

Q: What is RoCE, and when should we use it instead of InfiniBand?

RoCE (RDMA over Converged Ethernet) brings RDMA capability to Ethernet infrastructure. Use RoCE when:

  • You have existing 25/100 GbE infrastructure you want to reuse
  • Budget for InfiniBand is unavailable
  • Workloads are loosely-coupled (not tight MPI)

RoCE requires careful configuration of PFC (Priority Flow Control) and ECN (Explicit Congestion Notification). Poorly configured RoCE performs worse than standard TCP Ethernet. Correctly configured RoCE delivers performance close to InfiniBand at lower cost.


Storage Design

Q: What is a parallel filesystem, and do I need one?

A parallel filesystem (BeeGFS, Lustre, GPFS) allows multiple compute nodes to read and write the same files simultaneously at high speed. Each node communicates directly with multiple storage servers, achieving aggregate bandwidth that scales with the number of servers.

You need a parallel filesystem if:

  • Multiple nodes need to read shared input data simultaneously
  • Jobs write output to shared directories
  • Aggregate I/O bandwidth exceeds what a single NAS can deliver (typically >1 GB/s)

NFS is sufficient for home directories and small data sets. It becomes a bottleneck when many nodes simultaneously access the same files.

Q: How much scratch storage do we need?

Rule of thumb: scratch capacity = (average job input data size × number of concurrent jobs) × 3.

The 3× factor accounts for: input data, output data, and temporary files generated during the run. If your largest job processes 10 TB of input data and you want 10 concurrent jobs, minimum scratch = 300 TB.


Security

Q: What are the biggest security risks for HPC clusters?

  1. Shared compute with untrusted users: Unlike cloud VMs with hypervisor isolation, HPC jobs share physical hardware. Malicious users can potentially interfere with co-located jobs. Mitigate with cgroup isolation and restricted /proc access.

  2. Outbound data exfiltration: HPC clusters often contain sensitive research data. Egress filtering on the cluster network prevents unauthorized data transfer.

  3. Weak authentication: SSH password authentication is routinely brute-forced. Require SSH key authentication or certificates for all cluster access.

  4. Supply chain via untrusted containers: Users pulling containers from Docker Hub may inadvertently run malicious images. Implement an approved registry with image scanning.


Installation and Maintenance

Q: How long does it take to deploy an HPC cluster?

For a typical 50–200 node research cluster:

PhaseDuration
Architecture design and procurement4–8 weeks
Hardware delivery and racking2–4 weeks
OS installation and software configuration1–2 weeks
Acceptance testing and benchmark1 week
Pilot user testing2 weeks
Total10–17 weeks

Custom GPU clusters, complex storage, or enterprise integration (LDAP, billing) can extend the timeline.

Q: What ongoing maintenance does an HPC cluster require?

Monthly:

  • Security patch application (OS, SLURM, drivers)
  • Capacity utilization review
  • SLURM fairshare weight adjustment based on actual usage

Quarterly:

  • Hardware inspection (dust accumulation, cable checks)
  • Backup restore test
  • User account audit

Annually:

  • InfiniBand cable stress testing
  • Benchmark re-run vs. baseline (detect performance degradation)
  • DR failover test
  • Capacity planning review

Cloud vs. On-Premise

Q: Cloud providers offer “HPC” instances. Are they equivalent to on-premise HPC?

Cloud HPC instances offer genuine HPC capability but with differences:

AspectCloud HPCOn-Premise HPC
InfiniBandAvailable (EFA, HDR)Standard
Startup time3–10 minutes per nodeAlready running
Performance consistencyVariable (shared infra)Consistent
Data localityTransfer costs / latencyLocal
Cost at sustained load3–5× on-premiseLower
ElasticityUnlimitedFixed

Cloud HPC is excellent for: burst demand, temporary projects, proof-of-concept, workloads with infrequent peaks. On-premise is better for: sustained high utilization, data-intensive workloads, latency-sensitive MPI, regulatory requirements.


Have a question not answered here? Contact the Mevasis team for a direct technical consultation.