HPC Energy Efficiency: DVFS, Power Management, Cooling, and PUE Optimization
HPC energy efficiency strategies: energy cost impact analysis, DVFS with cpupower, nvidia-smi power limits, SLURM prolog/epilog for automatic CPU governor switching, energy-aware scheduling, dark-hours strategy, PUE explanation, hot/cold aisle containment, liquid cooling, and IPMI monitoring.
A 512-node HPC cluster with 64-core nodes drawing 500W each consumes 256 kW at full load. At $0.10/kWh, 24/7 operation, that is $224,256 per year in electricity costs — before cooling overhead. Energy efficiency is not just an environmental consideration; it directly affects the economic viability of HPC infrastructure and the amount of computing that can be run within a given power budget.
Energy Cost Analysis
Understanding where power goes is the first step:
| Component | Typical Power per Node | 64-Node Cluster Total |
|---|---|---|
| CPU (dual socket, full load) | 250–700 W | 16–45 kW |
| GPU (H100, per card × 4) | 700 W × 4 = 2,800 W | 179 kW |
| Memory (DDR5, 512 GB) | ~40 W | 2.5 kW |
| NIC/HCA (2× HDR IB) | ~25 W | 1.6 kW |
| Storage (NVMe × 4) | ~25 W | 1.6 kW |
| Idle overhead | ~30% | varies |
For GPU clusters, the GPU is the dominant power consumer (60–80% of node power at full load). For CPU-only clusters, the CPUs account for 50–70%.
DVFS: Dynamic Voltage and Frequency Scaling
DVFS reduces CPU power by operating at lower voltage and frequency when full performance is not needed. On Linux, the cpupower tool manages CPU frequency governors:
# Check available governors
cpupower frequency-info | grep -E "governor|frequency"
# Set to performance mode for compute jobs (maximum frequency)
cpupower frequency-set -g performance
# Set to powersave mode for idle nodes (minimum frequency)
cpupower frequency-set -g powersave
# Check current frequency across all cores
grep "cpu MHz" /proc/cpuinfo | awk '{sum += $4; n++} END {print "avg:", sum/n, "MHz"}'
The energy savings from powersave governor vs performance governor can be 20–40% at idle, with 0–5% performance impact on workloads that are memory-bandwidth or I/O bound (which are already not using peak CPU frequency).
GPU Power Management with nvidia-smi
NVIDIA GPUs support power limits via nvidia-smi:
# Check current power limit and consumption
nvidia-smi --query-gpu=name,power.limit,power.draw --format=csv
# Set power limit to 80% of TDP for thermal-limited environments
# H100 TDP: 700W -> 80% = 560W
nvidia-smi -pl 560
# For all GPUs in the system
for i in 0 1 2 3 4 5 6 7; do
nvidia-smi -i $i -pl 560
done
Setting GPU power limits to 80% typically reduces throughput by 5–10% for memory-bandwidth-bound workloads while reducing power by 20% and temperature by 8–10°C. For thermally constrained installations, this trade-off is often worth it.
SLURM Prolog/Epilog for Automatic Power Management
Automate CPU frequency management with SLURM job hooks:
# /usr/local/sbin/slurm-job-prolog.sh
# Runs as root on compute node when a job starts
#!/bin/bash
# Switch to performance governor when a job starts
cpupower frequency-set -g performance > /dev/null 2>&1
# Enable Turbo Boost
echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null || \
cpupower frequency-set --turbo-boost-on > /dev/null 2>&1
exit 0
# /usr/local/sbin/slurm-job-epilog.sh
# Runs as root on compute node after a job ends
#!/bin/bash
# Return to powersave when node becomes idle
cpupower frequency-set -g powersave > /dev/null 2>&1
exit 0
# slurm.conf
Prolog=/usr/local/sbin/slurm-job-prolog.sh
Epilog=/usr/local/sbin/slurm-job-epilog.sh
PrologFlags=Alloc
This ensures nodes run at full speed during jobs and conserve power between jobs without any user involvement.
Energy-Aware Scheduling
SLURM’s --hint=nomultithread directive disables Hyperthreading for a specific job, reducing power consumption by 10–15% for codes that do not benefit from HT:
# Submit job without hyperthreading
sbatch --hint=nomultithread my_simulation.sh
# For jobs that don't need all cores on a node, pack jobs
# to leave some nodes fully idle rather than all nodes partially loaded
# slurm.conf
SelectTypeParameters=CR_Core_Memory
Node consolidation via backfill: Rather than spreading jobs across many partially loaded nodes, configure SLURM to pack jobs onto fewer nodes and power-down unused nodes. Requires hardware support for remote power control via IPMI:
# SLURM power management with IPMI
SuspendProgram=/usr/local/sbin/suspend-node-ipmi.sh
ResumeProgram=/usr/local/sbin/resume-node-ipmi.sh
SuspendTimeout=90
ResumeTimeout=300
Dark-Hours Strategy
If the workload profile allows, schedule energy-intensive batch workloads during off-peak electricity hours (typically 10 PM – 6 AM in most markets). SLURM’s reservation system enables this:
# Create a recurring daily reservation for off-peak batch window
scontrol create \
ReservationName=off-peak-batch \
StartTime=22:00:00 \
EndTime=06:00:00 \
Duration=08:00:00 \
Nodes=ALL \
Flags=DAILY \
Users=ALL
In markets with time-of-use electricity pricing, shifting 40% of compute to off-peak hours can reduce electricity cost by 25–30%.
PUE: Power Usage Effectiveness
PUE measures data center efficiency:
PUE = Total Facility Power / IT Equipment Power
PUE = 1.0: Perfect (all power goes to IT)
PUE = 1.5: Typical datacenter (50% overhead for cooling, power)
PUE = 2.0: Inefficient (equal power for IT and overhead)
Reducing PUE from 2.0 to 1.3 for a 1 MW IT load saves 700 kW — equivalent to adding 70% more compute for the same electricity bill.
Cooling Efficiency
Hot/cold aisle containment: Separate hot exhaust air from cold supply air with physical containment barriers. Prevents hot air recirculation, which forces CRAC units to over-cool. A properly contained aisle can reduce cooling energy by 20–30%.
[Cold aisle] [server fronts] → hot air → [Hot aisle] → CRAC return
cold air ← CRAC supply ←
Raised inlet temperature: ASHRAE allows server inlet temperatures up to 27°C (A1 class) or even 35°C (A2 class). Many sites cool to 18°C by default, wasting cooling capacity. Raising the cooling setpoint from 18°C to 25°C reduces cooling energy by 30–40%.
Liquid cooling for GPU nodes: DGX H100 and HGX H100 configurations generate 1,000–4,000 W per rack unit. Air cooling at this density requires extremely high airflow (which is noisy and energy-intensive). Direct liquid cooling (rear-door heat exchangers or direct-to-chip cooling loops) removes heat at the source with 40–60% less energy than equivalent air cooling.
Monitoring Power with IPMI
# Read node power consumption via IPMI
ipmitool -H node01 -U admin -P password sdr type Power
# Log cluster-wide power to time series database
for node in $(sinfo -N --noheader -o "%N"); do
power=$(ipmitool -H ${node}-ipmi -U admin -P password sdr type Power 2>/dev/null \
| grep "Pwr Consumption" | awk '{print $NF}' | tr -d 'W')
echo "hpc_node_power_watts{node=\"$node\"} $power" | \
curl -s --data-binary @- http://pushgateway:9091/metrics/job/ipmi_power
done
Integrating power metrics into Grafana alongside SLURM job metrics enables correlation between workload type and power consumption — invaluable for TCO modeling and energy budget planning.
Energy efficiency in HPC is achievable without sacrificing performance. DVFS, GPU power limits, SLURM automation, and cooling optimization together can reduce total energy consumption by 30–50% compared to an untuned cluster. Contact Mevasis for HPC energy efficiency assessment and implementation.