/ Blog

HPC Security: Network Segmentation, SSH Certificates, 2FA, LDAP, and Compliance

Comprehensive HPC security guide: 4-segment network design with iptables rules, SSH certificate authentication, Google Authenticator 2FA with PAM, LDAP integration, tiered encryption, container isolation, SLURM security, auditd monitoring, anomaly detection, and GDPR compliance.

HPC clusters present a unique security challenge: they are shared infrastructure running code from many users, often holding sensitive research data, and accessed from external networks by researchers who may not be security-aware. Unlike a corporate laptop environment, you cannot install endpoint agents on compute nodes without affecting job performance. Security must be designed into the architecture.

Network Segmentation: Four Segments

The foundation of HPC security is physical or VLAN-level network segmentation. Four segments are the minimum:

SegmentVLANTrafficAccess
Management10IPMI/BMC, PXE boot, admin SSHAdmin hosts only
Compute/MPI20MPI inter-process, RDMACompute nodes only
Storage30BeeGFS/Lustre I/OCompute + storage nodes
User/External40User SSH to login nodesFiltered external access

Cross-segment traffic is denied by default. Specific exceptions (management to compute for SLURM, storage to compute for filesystem) are explicitly permitted by firewall rules.

iptables Rules for Login Node

# /etc/iptables/rules.v4 — login node firewall

*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]

# Allow established connections
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow loopback
-A INPUT -i lo -j ACCEPT

# Allow SSH from allowed external ranges only
-A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT       # internal
-A INPUT -p tcp --dport 22 -s 192.168.1.0/24 -j ACCEPT    # VPN range
# Optionally: rate-limit SSH from internet
-A INPUT -p tcp --dport 22 -m state --state NEW \
  -m recent --set --name ssh_ratelimit
-A INPUT -p tcp --dport 22 -m state --state NEW \
  -m recent --update --seconds 60 --hitcount 4 \
  --name ssh_ratelimit -j DROP

# Allow ICMP (ping) from monitoring
-A INPUT -p icmp --icmp-type echo-request -s 10.0.1.10 -j ACCEPT

# Log and drop everything else
-A INPUT -j LOG --log-prefix "DROPPED: " --log-level 6
-A INPUT -j DROP

COMMIT

Block Compute Nodes from Internet

Compute nodes must not have direct internet access — both for security and to prevent accidental data exfiltration:

# On compute nodes: deny all outbound except to cluster networks
iptables -A OUTPUT -d 10.0.0.0/8 -j ACCEPT
iptables -A OUTPUT -d 172.16.0.0/12 -j ACCEPT
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A OUTPUT -j REJECT

SSH Certificate Authentication

SSH public key authentication is good; SSH certificate authentication is better. Certificates enable time-limited access, centralized revocation, and audit trails without distributing authorized_keys files to every node.

# Generate the cluster CA key (keep this OFFLINE and protected)
ssh-keygen -t ed25519 -f /etc/ssh/cluster_ca -C "HPC Cluster CA"

# sshd_config on all cluster nodes — trust cluster CA
TrustedUserCAKeys /etc/ssh/cluster_ca.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
PasswordAuthentication no
ChallengeResponseAuthentication no
PubkeyAuthentication yes

# Create per-user authorized_principals file
# Only certificates with this principal can log in as that user
echo "login_users" > /etc/ssh/auth_principals/alice

# Issue user certificate (valid for 8 hours)
ssh-keygen -s /etc/ssh/cluster_ca \
  -I "alice@hpc" \
  -n login_users \
  -V "+8h" \
  -z 1 \
  ~/.ssh/id_ed25519.pub

Users receive a certificate that expires after 8 hours. Compromised credentials automatically expire. Revoking access is a matter of not issuing new certificates.

Two-Factor Authentication with Google Authenticator

For additional protection on login nodes, add TOTP-based 2FA via PAM:

# Install Google Authenticator PAM module
apt-get install libpam-google-authenticator   # Debian/Ubuntu
yum install google-authenticator              # RHEL

# User setup (run as each user)
google-authenticator -t -d -f -r 3 -R 30 -W

# /etc/pam.d/sshd — add before other auth lines
auth required pam_google_authenticator.so nullok
# nullok: allow access without TOTP during initial enrollment period

# /etc/ssh/sshd_config — require both key and TOTP
AuthenticationMethods publickey,keyboard-interactive
ChallengeResponseAuthentication yes

With this configuration, login requires both a valid SSH key (or certificate) AND a TOTP code from the authenticator app.

LDAP Authentication

Centralized user management via LDAP (or FreeIPA, Active Directory) eliminates per-node account management:

# Install and configure sssd (System Security Services Daemon)
yum install sssd sssd-ldap

# /etc/sssd/sssd.conf
[sssd]
domains = hpc.example.com
services = nss, pam

[domain/hpc.example.com]
id_provider = ldap
auth_provider = ldap
ldap_uri = ldaps://ldap01.hpc.example.com
ldap_search_base = dc=hpc,dc=example,dc=com
ldap_tls_reqcert = demand
ldap_tls_cacert = /etc/ssl/certs/hpc-ca.pem
cache_credentials = true
ldap_default_bind_dn = cn=sssd-reader,dc=hpc,dc=example,dc=com
ldap_default_authtok = <service-account-password>

Also configure SLURM to use LDAP for user account validation:

# slurm.conf
AuthType=auth/munge
AccountingStorageType=accounting_storage/slurmdbd

Ensure LDAP groups map to SLURM accounts so that adding a user to the LDAP group automatically grants them cluster access with appropriate fairshare allocation.

Encryption at Rest and in Transit

Data in transit:

  • SSH (all user sessions): encrypted by default
  • InfiniBand management plane (SM/SM traffic): enable TLS for OpenSM
  • SLURM daemon communications: protected by MUNGE (HMAC, not encryption)
  • Consider adding MACsec (IEEE 802.1AE) on InfiniBand for RDMA traffic if handling classified data

Data at rest:

  • Home directories on NAS: use filesystem-level encryption (LUKS, ZFS native encryption)
  • Sensitive project data: per-directory encryption with dm-crypt or filesystem ACLs
  • Backups: encrypt before leaving the cluster boundary (GPG or rclone crypt)
# Encrypt a backup directory with rclone crypt
rclone config add-encrypt hpc-encrypted-backup \
  type=crypt \
  remote=hpc-archive: \
  filename_encryption=standard \
  directory_name_encryption=true \
  password=<strong-password>

# Sync encrypted
rclone sync /project/sensitive hpc-encrypted-backup:

Container Isolation with Apptainer

Apptainer (Singularity) runs containers as the calling user, preventing privilege escalation. Additional hardening:

# Disable user-namespace creation for rootless containers (if kernel supports)
# apptainer.conf
allow setuid = yes                # required for some functionality
max loop devices = 256
allow pid ns = yes
config passwd = yes
config group = yes
mount proc = yes
mount sys = yes
mount dev = minimal               # restrict device access
mount home = yes
mount tmp = yes
# Restrict network in containers
allow net networks = none         # deny network namespaces

SLURM Security Configuration

# slurm.conf — security-relevant settings

# Protect job information visibility
PrivateData=jobs,usage,users

# Require authentication for all SLURM communications
AuthType=auth/munge
CryptoType=crypto/munge

# Restrict scontrol commands
DisableRootJobs=YES          # prevent root from submitting jobs
EnforcePartLimits=ALL        # partition limits are enforced strictly

# Audit all SLURM events
SlurmctldSyslogDebug=info
SlurmdSyslogDebug=info

Auditd Monitoring

Linux auditd provides kernel-level audit logging:

# /etc/audit/rules.d/hpc-security.rules

# Log all SSH logins (execve of sshd)
-a always,exit -F arch=b64 -S execve -F path=/usr/sbin/sshd -k ssh_login

# Log privilege escalation attempts
-a always,exit -F arch=b64 -S setuid -S setgid -k privilege_escalation
-w /etc/passwd -p wa -k user_modification
-w /etc/shadow -p wa -k user_modification
-w /etc/group -p wa -k group_modification

# Log SLURM admin commands
-a always,exit -F arch=b64 -S execve -F path=/usr/bin/sacctmgr -k slurm_admin
-a always,exit -F arch=b64 -S execve -F path=/usr/bin/scontrol -k slurm_admin

# Log file access in sensitive directories
-w /etc/slurm/ -p wa -k slurm_config
-w /etc/munge/ -p wa -k munge_config

Forward auditd logs to a centralized SIEM (Graylog, Splunk, Elastic SIEM) for correlation across cluster nodes.

Anomaly Detection

Configure automated alerts for suspicious activity:

# Monitor for failed SSH attempts (excessive = brute force)
grep "Failed password" /var/log/auth.log | \
  awk '{print $11}' | sort | uniq -c | sort -rn | \
  awk '$1 > 10 {print "ALERT: Brute force from " $2, $1, "attempts"}'

# Alert on SLURM job submitting unusual number of processes
sacct --state=RUNNING --format=JobID,User,NCPUS,Elapsed | \
  awk 'NR>2 && $3 > 1000 {print "ALERT: Large job by user " $2 " using " $3 " CPUs"}'

HPC security requires a layered approach where no single control point is relied upon exclusively. Network segmentation, strong authentication, monitoring, and data protection work together to reduce risk. Contact Mevasis for HPC security assessment and hardening services.