HPC FAQ: Everything You Need to Know About HPC Infrastructure

This FAQ addresses the most common questions organizations ask when first approaching HPC infrastructure — from fundamental definitions to procurement decisions and operational concerns.

What Is HPC?

Q: What exactly is HPC, and how is it different from an enterprise server?

HPC (High Performance Computing) refers to computing systems designed for sustained, parallel workloads that exceed the capacity of any single server. The distinction from enterprise servers is architectural:

Enterprise servers optimize for transaction throughput, high availability, and mixed workloads. A typical 2-socket server handles web requests, database queries, and virtualization efficiently.
HPC clusters optimize for peak floating-point performance, memory bandwidth, and inter-node communication. A 100-node HPC cluster runs a single simulation across all nodes simultaneously.

Key differences:

Aspect	Enterprise Server	HPC Cluster
Scaling model	Vertical (bigger server)	Horizontal (more nodes)
Interconnect	10/25 GbE	InfiniBand (100–400 Gb/s)
Storage	SAN / NAS	Parallel filesystem (BeeGFS/Lustre)
Job management	VM scheduler	HPC job scheduler (SLURM)
Optimization target	Throughput/availability	Peak compute/bandwidth

Q: What kinds of problems require HPC?

HPC is necessary when:

The computation cannot fit on a single server (model too large, simulation too complex)
The computation is time-sensitive and must complete in hours, not weeks
Multiple researchers need to run simultaneous workloads on shared infrastructure
The dataset is too large to move (analysis must come to the data)

Typical domains: computational fluid dynamics (CFD), molecular dynamics, finite element analysis (FEA), machine learning / AI training, genomics/bioinformatics, seismic processing, weather modeling, financial Monte Carlo simulation.

Buy vs. Rent

Q: Should we buy our own HPC cluster or use cloud HPC?

Both approaches have valid use cases. Key decision factors:

Factor	On-Premise	Cloud HPC
Capital requirement	High upfront	Minimal
Unit cost at scale	3–5× cheaper/flop	Convenience premium
Data sovereignty	Full control	Data leaves premises
Peak flexibility	Fixed capacity	Elastic
Latency-sensitive MPI	InfiniBand available	Cloud network may limit
Total utilization	Must plan for peaks	Pay for peaks only

Rule of thumb: If your utilization will exceed 60–70% consistently, on-premise is more cost-effective. Below 40%, cloud or leased HPC is usually cheaper when total cost of ownership is calculated honestly.

Hybrid approach (on-premise baseline + cloud burst for peaks) is increasingly common and often optimal.

SLURM and Job Scheduling

Q: What is SLURM and why does every HPC cluster use it?

SLURM (Simple Linux Utility for Resource Management) is an open-source HPC workload manager. It handles:

Accepting job submissions from users
Queuing jobs and deciding scheduling order based on priority, fairshare, and resources
Allocating nodes to jobs and launching applications on them
Tracking resource usage for billing and capacity planning

SLURM is used on more than 60% of the Top500 supercomputers because it is: open source (no license cost), highly configurable, scalable to millions of cores, and has an enormous ecosystem of integrations.

Q: A job has been pending in the queue for 4 hours. Why?

Common causes, check in order:

# See why a specific job is pending
squeue -j <jobid> --start
scontrol show job <jobid> | grep Reason

SLURM Pending Reason	Meaning
`Resources`	Requested resources not currently available
`Priority`	Resources available but higher-priority job is first in line
`QOSMaxCPUPerUserLimit`	User has hit their CPU quota
`PartitionNodeLimit`	Requesting more nodes than partition allows
`InvalidQOS`	QOS not assigned to user’s account
`Dependency`	Job depends on another job that hasn’t finished

Network Technologies

Q: Why do HPC clusters use InfiniBand instead of Ethernet?

Three reasons: latency, bandwidth, and CPU overhead.

Latency: InfiniBand delivers 1–2 µs end-to-end latency. Standard Ethernet delivers 50–200 µs. For MPI applications that synchronize thousands of times per second, this difference is decisive.

Bandwidth: InfiniBand HDR delivers 200 Gb/s per port; NDR delivers 400 Gb/s. Standard 100 GbE is cost-competitive on bandwidth but not on latency.

CPU overhead (RDMA): InfiniBand uses RDMA (Remote Direct Memory Access), which bypasses the operating system and CPU for data transfers. Standard TCP/IP Ethernet requires the CPU to process every packet, consuming cores that could otherwise do computation.

Q: What is RoCE, and when should we use it instead of InfiniBand?

RoCE (RDMA over Converged Ethernet) brings RDMA capability to Ethernet infrastructure. Use RoCE when:

You have existing 25/100 GbE infrastructure you want to reuse
Budget for InfiniBand is unavailable
Workloads are loosely-coupled (not tight MPI)

RoCE requires careful configuration of PFC (Priority Flow Control) and ECN (Explicit Congestion Notification). Poorly configured RoCE performs worse than standard TCP Ethernet. Correctly configured RoCE delivers performance close to InfiniBand at lower cost.

Storage Design

Q: What is a parallel filesystem, and do I need one?

A parallel filesystem (BeeGFS, Lustre, GPFS) allows multiple compute nodes to read and write the same files simultaneously at high speed. Each node communicates directly with multiple storage servers, achieving aggregate bandwidth that scales with the number of servers.

You need a parallel filesystem if:

Multiple nodes need to read shared input data simultaneously
Jobs write output to shared directories
Aggregate I/O bandwidth exceeds what a single NAS can deliver (typically >1 GB/s)

NFS is sufficient for home directories and small data sets. It becomes a bottleneck when many nodes simultaneously access the same files.

Q: How much scratch storage do we need?

Rule of thumb: scratch capacity = (average job input data size × number of concurrent jobs) × 3.

The 3× factor accounts for: input data, output data, and temporary files generated during the run. If your largest job processes 10 TB of input data and you want 10 concurrent jobs, minimum scratch = 300 TB.

Security

Q: What are the biggest security risks for HPC clusters?

Shared compute with untrusted users: Unlike cloud VMs with hypervisor isolation, HPC jobs share physical hardware. Malicious users can potentially interfere with co-located jobs. Mitigate with cgroup isolation and restricted /proc access.
Outbound data exfiltration: HPC clusters often contain sensitive research data. Egress filtering on the cluster network prevents unauthorized data transfer.
Weak authentication: SSH password authentication is routinely brute-forced. Require SSH key authentication or certificates for all cluster access.
Supply chain via untrusted containers: Users pulling containers from Docker Hub may inadvertently run malicious images. Implement an approved registry with image scanning.

Installation and Maintenance

Q: How long does it take to deploy an HPC cluster?

For a typical 50–200 node research cluster:

Phase	Duration
Architecture design and procurement	4–8 weeks
Hardware delivery and racking	2–4 weeks
OS installation and software configuration	1–2 weeks
Acceptance testing and benchmark	1 week
Pilot user testing	2 weeks
Total	10–17 weeks

Custom GPU clusters, complex storage, or enterprise integration (LDAP, billing) can extend the timeline.

Q: What ongoing maintenance does an HPC cluster require?

Monthly:

Security patch application (OS, SLURM, drivers)
Capacity utilization review
SLURM fairshare weight adjustment based on actual usage

Quarterly:

Hardware inspection (dust accumulation, cable checks)
Backup restore test
User account audit

Annually:

InfiniBand cable stress testing
Benchmark re-run vs. baseline (detect performance degradation)
DR failover test
Capacity planning review

Cloud vs. On-Premise

Q: Cloud providers offer “HPC” instances. Are they equivalent to on-premise HPC?

Cloud HPC instances offer genuine HPC capability but with differences:

Aspect	Cloud HPC	On-Premise HPC
InfiniBand	Available (EFA, HDR)	Standard
Startup time	3–10 minutes per node	Already running
Performance consistency	Variable (shared infra)	Consistent
Data locality	Transfer costs / latency	Local
Cost at sustained load	3–5× on-premise	Lower
Elasticity	Unlimited	Fixed

Cloud HPC is excellent for: burst demand, temporary projects, proof-of-concept, workloads with infrequent peaks. On-premise is better for: sustained high utilization, data-intensive workloads, latency-sensitive MPI, regulatory requirements.

Have a question not answered here? Contact the Mevasis team for a direct technical consultation.

HPC Frequently Asked Questions: Architecture, SLURM, Storage, Security, and Cloud