HPC Cluster Deployment Checklist: Installation to Production

A checklist is not a sign of distrust in expertise — it is how aviation, surgery, and nuclear power consistently avoid catastrophic failures. HPC cluster deployment has enough independent steps across hardware, network, storage, and software that skipping even one can cause hours of debugging or force a complete rebuild. Use this checklist for every new cluster deployment and major upgrade.

Phase 1: Pre-Installation Planning

Architecture and Design

Workload analysis completed: application types, job sizes, I/O patterns documented
Node count, memory, and storage sized to workload requirements
Network topology designed (fat-tree port count, InfiniBand vs. RoCE decision)
Storage tier design complete (scratch, project, archive capacities)
Management and HA strategy defined (slurmctld HA, storage mirroring)
IP address plan and VLAN assignment documented
Security requirements reviewed (compliance, data classification)

Physical Infrastructure

Rack space reserved and labeled
Power circuits provisioned and PDUs ordered
Network cabling plan reviewed with datacenter team
Cooling capacity verified for peak GPU/CPU thermal load
Physical access and delivery scheduling confirmed

Phase 2: Hardware Installation

Server Installation

Servers racked and mounted per rack diagram
Power cables connected and redundant PSU verified
Network cables connected per cabling plan and labeled
InfiniBand cables connected and transceivers verified
BIOS/UEFI updated to latest vendor-recommended version
BIOS settings optimized: NUMA, IOMMU, SR-IOV, C-states, Turbo as appropriate
IPMI/BMC configured with static IP and admin credentials
Remote console access verified on all nodes

Network Hardware

InfiniBand switches powered and cabled
Switch firmware updated to recommended version
Ethernet switches powered and uplinks verified
Management VLAN, compute VLAN, storage VLAN configured on switches
MTU 9000 (jumbo frames) enabled on compute and storage switch ports

Phase 3: Software Configuration

Base OS

Operating system installed on all nodes via PXE/network boot
OS version identical across all compute nodes
NTP/Chrony configured and synchronized (< 100 ms offset from source)
SSH configured: key authentication only, password disabled, SSH host keys distributed
/etc/hosts or DNS entries for all cluster nodes verified
Firewall rules configured: only required ports open between network segments

Authentication

LDAP or FreeIPA server deployed and functional
All cluster nodes joined to LDAP/FreeIPA domain
User creation, home directory provisioning tested
MUNGE installed and identical key distributed to all nodes
MUNGE service running on all nodes: munge -n | unmunge passes

SLURM

SLURM packages installed: slurmctld, slurmd, slurmdbd, munge
/etc/slurm/slurm.conf reviewed and validated with slurmctld -D -v
Partition definitions match intended workload profiles
slurmctld running on controller node
slurmd running on all compute nodes: sinfo shows all nodes IDLE
slurmdbd running and connected to controller
Test job submission: sbatch --wrap="hostname" runs and completes
GRES (GPU) configuration verified: sinfo -o "%N %G" shows correct GPU count

Phase 4: Storage Verification

Parallel Filesystem (BeeGFS/Lustre)

Management service (mgmtd) running and accessible
All metadata servers registered: beegfs-ctl --listnodes --nodetype=meta
All storage servers registered: beegfs-ctl --listnodes --nodetype=storage
Filesystem mounted on all compute nodes: df -h /mnt/scratch shows expected size
Write permission test: all users can write to their scratch directory
Quota system configured and enforced (where applicable)

I/O Benchmark

IOR sequential write benchmark run and results recorded
IOR sequential read benchmark run and results recorded
Aggregate bandwidth meets design specification (within 20%)
mdtest small-file metadata performance verified

Backup

Backup system configured for home directories and project storage
First backup run completed successfully
Restore test: random file restored from backup and verified

Phase 5: Security and Access Control

Network Security

Management network accessible only from admin hosts
Compute nodes not directly accessible from external network
InfiniBand / compute network isolated from management and external
IPMI/BMC management interface on dedicated out-of-band network

Authentication and Authorization

Root SSH login disabled on all nodes
Only authorized admin keys in root ~/.ssh/authorized_keys
sudo configured with principle of least privilege
User home directory permissions correct (mode 700 or 750)
/etc/sudoers reviewed and non-default rules documented

Software Security

All nodes patched to current security errata
Unneeded services disabled (Bluetooth, avahi, cups, etc.)
SELinux or AppArmor policy reviewed (enforcing or permissive, documented)
Auditd configured and logging to central syslog

Phase 6: Monitoring and Alerting

Metrics Collection

Node Exporter deployed on all compute nodes
SLURM Exporter deployed on controller node
DCGM Exporter deployed on all GPU nodes
Prometheus scraping all exporters (verify with curl prometheus:9090/targets)
Grafana dashboards: cluster overview, GPU detail, SLURM queue, storage I/O

Alert Rules

Alert: node down (unreachable for > 5 minutes)
Alert: GPU temperature > 85°C for > 5 minutes
Alert: GPU ECC double-bit errors detected
Alert: storage target unavailable
Alert: queue pending time > threshold (per cluster SLA)
Alert: disk usage > 85% on management and storage nodes
Alert notification channels (email, Slack, PagerDuty) tested

Phase 7: Operational Readiness

Documentation

Network diagram (physical and logical) current and stored in shared location
IPMI/BMC credentials in password manager (not in a spreadsheet)
SLURM configuration changes tracked in version control (Git)
Operating runbook: daily checks, common procedures, escalation contacts
User documentation: cluster access, job submission, software modules

Training and Handoff

System administrators trained on cluster procedures
First-line support staff know how to check SLURM queue and node status
User onboarding documentation provided to pilot users
Feedback mechanism established for user issues

Go-Live

Pilot test with 3–5 users and representative workloads for 2 weeks
No open P1 or P2 issues from pilot
Monitoring alerts tested end-to-end (fire alert, verify notification received)
Backup restore tested and documented
SLURM HA failover tested (if configured)
Sign-off from system owner before production announcement

A completed checklist does not guarantee a perfect cluster, but it dramatically reduces the probability of avoidable failures in the first production weeks. Adapt this template to your organization’s specific requirements and compliance obligations.

For HPC cluster deployment and commissioning services, contact Mevasis.

HPC Cluster Deployment Checklist: 7-Phase Pre-Production Verification