What Is HPC? High Performance Computing Explained
What is HPC (High Performance Computing)? FLOPS scale, parallel computing, cluster nodes, networking, job schedulers, application domains, Amdahl's law, and cloud vs. on-premise comparison.
High Performance Computing (HPC) refers to computing systems and practices that aggregate substantial computational resources to solve problems that a standard workstation or server cannot handle in a practical timeframe. HPC systems range from small departmental clusters with a few dozen nodes to the world’s largest supercomputers operating at exascale (10¹⁸ floating-point operations per second).
Measuring Computational Scale: FLOPS
The standard performance metric for HPC systems is FLOPS — Floating-Point Operations Per Second:
| Scale | Notation | Example |
|---|---|---|
| GigaFLOPS | 10⁹ | High-end gaming PC (CPU) |
| TeraFLOPS | 10¹² | NVIDIA A100 GPU |
| PetaFLOPS | 10¹⁵ | World-class national supercomputer |
| ExaFLOPS | 10¹⁸ | Frontier (Oak Ridge), Aurora (Argonne) |
A 256-node cluster with dual 96-core AMD EPYC processors achieves approximately 3–5 PetaFLOPS of double-precision (FP64) compute. The world’s fastest supercomputer (Frontier at Oak Ridge) achieves over 1 ExaFLOPS — roughly 200 times more.
Parallel Computing: The Core Concept
HPC is fundamentally about parallelism — dividing a large computation into smaller parts that execute simultaneously on many processors. Three types of parallelism are used:
Data parallelism: The same operation is applied to different portions of the data simultaneously. Example: 1 billion particles in a molecular dynamics simulation, divided into 1 million groups of 1000 particles each, with each group processed by a different processor.
Task parallelism: Different operations (which may be independent) execute simultaneously. Example: genome sequencing pipeline with read alignment, variant calling, and quality filtering running concurrently.
Pipeline parallelism: A sequence of operations where each stage processes a stream of data while passing results to the next stage. Example: assembly line model for neural network training.
HPC Cluster Components
A modern HPC cluster consists of several distinct node types:
Login nodes: The entry point for users. Researchers SSH into login nodes, prepare data, write job scripts, and submit jobs to the scheduler. Login nodes must never be used for production computation.
Compute nodes: Where the actual scientific computation runs. Typically homogeneous within a partition — same CPU model, same memory, same network interface. A 128-node cluster has 128 compute nodes that together form a shared computing pool.
GPU nodes: Specialized compute nodes with NVIDIA or AMD GPUs for accelerated workloads (AI training, molecular dynamics, CFD with GPU solvers). May have 4–8 GPUs per node connected via NVLink.
Management node: Runs the cluster control software (SLURM controller, monitoring, authentication). Not involved in computation directly.
Storage nodes: Run the parallel filesystem (BeeGFS, Lustre). Provide shared storage that all compute nodes access simultaneously.
High-Speed Interconnects
Compute nodes communicate via specialized high-speed networks:
InfiniBand: The dominant HPC interconnect. HDR InfiniBand delivers 200 Gb/s per port at ~1 µs latency. NDR InfiniBand delivers 400 Gb/s. The latency advantage (100× lower than Ethernet) is critical for MPI parallel applications that synchronize thousands of times per second.
RDMA (Remote Direct Memory Access): InfiniBand enables RDMA, where a node writes directly to another node’s memory without involving the remote CPU or OS. This eliminates protocol overhead and is essential for tight MPI communication.
Ethernet: Used for management, storage (in some configurations), and cloud HPC. 25 GbE or 100 GbE is common. RoCE (RDMA over Converged Ethernet) brings RDMA capability to Ethernet infrastructure.
Job Schedulers: SLURM
HPC clusters use job schedulers to manage which jobs run on which nodes when. The dominant scheduler is SLURM (Simple Linux Utility for Resource Management), used on over 60% of the Top500 supercomputers.
A simple SLURM job script:
#!/bin/bash
#SBATCH --job-name=climate_model
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=64
#SBATCH --mem=256G
#SBATCH --time=48:00:00
#SBATCH --partition=compute
module load openmpi/4.1.6
module load openfoam/v2312
cd $SLURM_SUBMIT_DIR
mpirun -np 2048 climateFoam -parallel
The user submits this script with sbatch. SLURM queues the job, waits until 32 nodes with 64 cores each (2048 total MPI processes) are available, allocates them, starts the simulation, and releases the nodes when it completes.
Application Domains
HPC enables computations that would take decades on a laptop to complete in hours:
Computational Fluid Dynamics (CFD): Simulating airflow over aircraft wings, blood flow through heart valves, weather patterns over entire continents.
Molecular Dynamics (MD): Simulating the physical movement of proteins, DNA, and drug molecules at atomic scale. Fundamental to drug discovery and materials science.
Finite Element Analysis (FEA): Stress analysis of bridges, crash simulation for automotive safety, seismic response of buildings.
Machine Learning / AI: Training large language models (LLaMA, GPT-4 class) and computer vision systems. The current dominant workload on GPU clusters.
Genomics / Bioinformatics: Whole genome sequencing, variant calling, transcriptomics analysis. Scale of computation grows faster than sequencing throughput.
Seismic Processing: Subsurface imaging for oil and gas exploration and earthquake hazard assessment.
Climate and Weather Modeling: Numerical weather prediction, decadal climate simulations, hurricane tracking.
Amdahl’s Law: The Limits of Parallelism
Amdahl’s Law states that the speedup achievable by parallelizing a program is limited by its sequential fraction:
Speedup = 1 / ((1 - P) + P/N)
Where:
P = fraction of code that can be parallelized
N = number of processors
Example: If 95% of a simulation can be parallelized (P = 0.95):
At N = 100: Speedup = 1 / (0.05 + 0.95/100) = 16.8×
At N = 1000: Speedup = 1 / (0.05 + 0.95/1000) = 19.8×
At N → ∞: Speedup approaches 20× (limited by the 5% serial fraction)
This has profound implications: a code with 5% sequential code cannot achieve more than 20× speedup regardless of how many processors are used. Code optimization (reducing the serial fraction) is often more impactful than buying more hardware.
Cloud HPC vs. On-Premise
Both approaches have their place:
| Aspect | On-Premise HPC | Cloud HPC |
|---|---|---|
| Upfront cost | High (CapEx) | Low |
| Long-term cost | 3–5× cheaper at sustained use | Convenience premium |
| Performance | Full control, InfiniBand native | Good, network may limit |
| Data control | Full control | Requires trust in provider |
| Elasticity | Fixed capacity | Scales to thousands of nodes instantly |
| Setup time | Weeks to months | Hours |
The economically optimal answer for most organizations with predictable demand > 60% utilization is on-premise for the baseline and cloud for occasional bursts.
HPC is the infrastructure that enables discovery at the frontier of science and engineering. Whether you are planning your first cluster or expanding existing infrastructure, the fundamental choices — processor architecture, interconnect, storage, scheduler — determine the return on every computing dollar spent for the next 5 years. Contact Mevasis to discuss your HPC requirements.