What Is HPC? High Performance Computing Explained

High Performance Computing (HPC) refers to computing systems and practices that aggregate substantial computational resources to solve problems that a standard workstation or server cannot handle in a practical timeframe. HPC systems range from small departmental clusters with a few dozen nodes to the world’s largest supercomputers operating at exascale (10¹⁸ floating-point operations per second).

Measuring Computational Scale: FLOPS

The standard performance metric for HPC systems is FLOPS — Floating-Point Operations Per Second:

Scale	Notation	Example
GigaFLOPS	10⁹	High-end gaming PC (CPU)
TeraFLOPS	10¹²	NVIDIA A100 GPU
PetaFLOPS	10¹⁵	World-class national supercomputer
ExaFLOPS	10¹⁸	Frontier (Oak Ridge), Aurora (Argonne)

A 256-node cluster with dual 96-core AMD EPYC processors achieves approximately 3–5 PetaFLOPS of double-precision (FP64) compute. The world’s fastest supercomputer (Frontier at Oak Ridge) achieves over 1 ExaFLOPS — roughly 200 times more.

Parallel Computing: The Core Concept

HPC is fundamentally about parallelism — dividing a large computation into smaller parts that execute simultaneously on many processors. Three types of parallelism are used:

Data parallelism: The same operation is applied to different portions of the data simultaneously. Example: 1 billion particles in a molecular dynamics simulation, divided into 1 million groups of 1000 particles each, with each group processed by a different processor.

Task parallelism: Different operations (which may be independent) execute simultaneously. Example: genome sequencing pipeline with read alignment, variant calling, and quality filtering running concurrently.

Pipeline parallelism: A sequence of operations where each stage processes a stream of data while passing results to the next stage. Example: assembly line model for neural network training.

HPC Cluster Components

A modern HPC cluster consists of several distinct node types:

Login nodes: The entry point for users. Researchers SSH into login nodes, prepare data, write job scripts, and submit jobs to the scheduler. Login nodes must never be used for production computation.

Compute nodes: Where the actual scientific computation runs. Typically homogeneous within a partition — same CPU model, same memory, same network interface. A 128-node cluster has 128 compute nodes that together form a shared computing pool.

GPU nodes: Specialized compute nodes with NVIDIA or AMD GPUs for accelerated workloads (AI training, molecular dynamics, CFD with GPU solvers). May have 4–8 GPUs per node connected via NVLink.

Management node: Runs the cluster control software (SLURM controller, monitoring, authentication). Not involved in computation directly.

Storage nodes: Run the parallel filesystem (BeeGFS, Lustre). Provide shared storage that all compute nodes access simultaneously.

High-Speed Interconnects

Compute nodes communicate via specialized high-speed networks:

InfiniBand: The dominant HPC interconnect. HDR InfiniBand delivers 200 Gb/s per port at ~1 µs latency. NDR InfiniBand delivers 400 Gb/s. The latency advantage (100× lower than Ethernet) is critical for MPI parallel applications that synchronize thousands of times per second.

RDMA (Remote Direct Memory Access): InfiniBand enables RDMA, where a node writes directly to another node’s memory without involving the remote CPU or OS. This eliminates protocol overhead and is essential for tight MPI communication.

Ethernet: Used for management, storage (in some configurations), and cloud HPC. 25 GbE or 100 GbE is common. RoCE (RDMA over Converged Ethernet) brings RDMA capability to Ethernet infrastructure.

Job Schedulers: SLURM

HPC clusters use job schedulers to manage which jobs run on which nodes when. The dominant scheduler is SLURM (Simple Linux Utility for Resource Management), used on over 60% of the Top500 supercomputers.

A simple SLURM job script:

#!/bin/bash
#SBATCH --job-name=climate_model
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=64
#SBATCH --mem=256G
#SBATCH --time=48:00:00
#SBATCH --partition=compute

module load openmpi/4.1.6
module load openfoam/v2312

cd $SLURM_SUBMIT_DIR
mpirun -np 2048 climateFoam -parallel

The user submits this script with sbatch. SLURM queues the job, waits until 32 nodes with 64 cores each (2048 total MPI processes) are available, allocates them, starts the simulation, and releases the nodes when it completes.

Application Domains

HPC enables computations that would take decades on a laptop to complete in hours:

Computational Fluid Dynamics (CFD): Simulating airflow over aircraft wings, blood flow through heart valves, weather patterns over entire continents.

Molecular Dynamics (MD): Simulating the physical movement of proteins, DNA, and drug molecules at atomic scale. Fundamental to drug discovery and materials science.

Finite Element Analysis (FEA): Stress analysis of bridges, crash simulation for automotive safety, seismic response of buildings.

Machine Learning / AI: Training large language models (LLaMA, GPT-4 class) and computer vision systems. The current dominant workload on GPU clusters.

Genomics / Bioinformatics: Whole genome sequencing, variant calling, transcriptomics analysis. Scale of computation grows faster than sequencing throughput.

Seismic Processing: Subsurface imaging for oil and gas exploration and earthquake hazard assessment.

Climate and Weather Modeling: Numerical weather prediction, decadal climate simulations, hurricane tracking.

Amdahl’s Law: The Limits of Parallelism

Amdahl’s Law states that the speedup achievable by parallelizing a program is limited by its sequential fraction:

Speedup = 1 / ((1 - P) + P/N)

Where:
  P = fraction of code that can be parallelized
  N = number of processors

Example: If 95% of a simulation can be parallelized (P = 0.95):
  At N = 100:  Speedup = 1 / (0.05 + 0.95/100) = 16.8×
  At N = 1000: Speedup = 1 / (0.05 + 0.95/1000) = 19.8×
  At N → ∞:   Speedup approaches 20× (limited by the 5% serial fraction)

This has profound implications: a code with 5% sequential code cannot achieve more than 20× speedup regardless of how many processors are used. Code optimization (reducing the serial fraction) is often more impactful than buying more hardware.

Cloud HPC vs. On-Premise

Both approaches have their place:

Aspect	On-Premise HPC	Cloud HPC
Upfront cost	High (CapEx)	Low
Long-term cost	3–5× cheaper at sustained use	Convenience premium
Performance	Full control, InfiniBand native	Good, network may limit
Data control	Full control	Requires trust in provider
Elasticity	Fixed capacity	Scales to thousands of nodes instantly
Setup time	Weeks to months	Hours

The economically optimal answer for most organizations with predictable demand > 60% utilization is on-premise for the baseline and cloud for occasional bursts.

HPC is the infrastructure that enables discovery at the frontier of science and engineering. Whether you are planning your first cluster or expanding existing infrastructure, the fundamental choices — processor architecture, interconnect, storage, scheduler — determine the return on every computing dollar spent for the next 5 years. Contact Mevasis to discuss your HPC requirements.