Blog
Technical notes from our team on HPC, AI infrastructure and high-performance systems.
BeeGFS Technical Guide: Architecture, Installation, and Best Practices
BeeGFS parallel filesystem architecture, four-component design (management, metadata, storage, client), installation steps, troubleshooting, and production best practices for HPC clusters.
Cloud Bursting for HPC: Architecture, SLURM Configuration, and Cost Control
How to implement cloud bursting for HPC clusters: SLURM scheduler configuration, network connectivity options, spot/preemptible instances, and integration with AWS, Azure, and Google Cloud.
Container Platform Guide for HPC: Apptainer Architecture, Installation, and Best Practices
Why Apptainer (formerly Singularity) is the right container platform for HPC, its three-component architecture (SIF image, definition file, central registry), installation, configuration, and best practices.
CPU Cluster Technical Guide: AMD EPYC vs Intel Xeon, SLURM Configuration, and Workloads
Technical guide for CPU-based HPC clusters: AMD EPYC vs Intel Xeon comparison, suitable workloads (MPI simulations, CFD, Monte Carlo), SLURM configuration, InfiniBand, BeeGFS vs Lustre, NUMA optimization, and benchmark validation.
CUDA Programming Guide: GPU Architecture, Kernels, Memory, and Profiling
Comprehensive CUDA programming guide: GPU vs CPU architecture, streaming multiprocessors, warps, thread hierarchy, memory types (global, shared, constant), kernel writing, streams, debugging, and performance optimization.
GPU Cluster Technical Guide: Architecture, Parallelism Strategies, and Best Practices
GPU cluster technical guide: DGX H100 and HGX H100 architecture, data/model/pipeline/tensor parallelism, SLURM vs Kubernetes scheduling, network bottlenecks, GPU memory issues, thermal management, benchmarks, and best practices.
GPU Memory Management for HPC and AI: Hierarchy, Bottlenecks, and Optimization
GPU memory hierarchy for HPC and AI workloads: HBM3 vs GDDR6 comparison, detecting memory bottlenecks with ncu, mixed precision training, gradient accumulation, gradient checkpointing, Flash Attention, and ZeRO optimizer.
GPU Selection Guide for HPC and AI: H100, A100, L40S Comparison
How to choose the right GPU for HPC and AI workloads: NVIDIA H100 Hopper, A100 Ampere, and L40S Ada Lovelace comparison, use cases, multi-GPU NVLink considerations, TCO analysis, and decision framework.
HPC Autoscaling: Dynamic Node Management with SLURM and Cloud Platforms
How to configure SLURM autoscaling for HPC clusters using AWS ParallelCluster, Azure CycleCloud, and Google Cloud HPC Toolkit. ResumeProgram, SuspendProgram, and cloud integration.
HPC Backup Strategy: Data Classification, Incremental Backup, Tape Archive, and Cloud
Comprehensive HPC backup strategy: why HPC backup differs from standard IT, data classification, rsync incremental backup scripts, BeeGFS Buddy Mirroring, LTO-9 tape archiving with Bacula/Bareos, rclone for object storage, 3-2-1 rule for HPC, and retention policies.