
AI Inference Cluster: High Availability and Low Latency
SLA-guaranteed inference infrastructure optimized for AI model serving in production environments.
Moving a trained model into production creates very different engineering requirements from the training phase. Instead of batch processing, millisecond-level responses; instead of a single GPU server, hundreds of simultaneous requests; instead of an experimental environment, a 99.9% uptime commitment. Production inference is a separate infrastructure discipline focused on high availability and low latency.
Why Does Inference Require Different Infrastructure from Training?
Model training aims to complete a single job running for weeks: high fault tolerance, throughput-oriented, latency-tolerant. Inference is the exact opposite.
- Low latency: Customer-facing applications typically define a P99 latency target of 20–100 ms. Exceeding this value directly degrades the user experience.
- High concurrency: Thousands of requests can reach the same model per second; infrastructure must handle these without queuing or resource contention.
- Scaling elasticity: Traffic fluctuates. Load that drops to a minimum at night can multiply within minutes during a campaign or crisis event.
- High availability: Service outages mean revenue loss and brand damage; SLA guarantees are mandatory.
- Model versioning: Zero-downtime deployment of new model versions (blue-green or canary deployment) has become a production requirement.
Typical Inference Workloads
Large Language Model (LLM) Serving
GPT-style autoregressive models (Llama, Mistral, Falcon series, and enterprise fine-tuned variants) represent the most challenging inference scenarios due to their per-token compute intensity. Response time depends on both prefill and decode phases; both phases must be optimized separately.
Critical software components:
- vLLM: Dramatically increases KV-cache efficiency with PagedAttention; provides OpenAI-compatible API
- TensorRT-LLM: Optimized version of NVIDIA’s LLM inference engine; INT8/FP8 quantization support on H100/A100
- SGLang: High throughput for complex multi-step prompt structures
Image and Video Processing
Object detection, segmentation, optical character recognition, and video analysis pipelines — especially in streaming scenarios — require critical management of simultaneous GPU threads.
Frameworks used: NVIDIA Triton Inference Server, ONNX Runtime, TensorRT, OpenVINO (for Intel CPU fallback)
Multimodal and Embedded Model Serving
Running audio-image-text models such as CLIP, LLaVA, and Whisper together on the same cluster requires special configuration for resource isolation and load balancing.
Enterprise Recommendation and Personalization Engines
Real-time recommendation systems in e-commerce, fintech, and media sectors target the combination of low latency (<10 ms) and very high QPS (queries per second). In these scenarios, models are small but throughput requirements are extremely high.
Reference Inference Cluster Architecture
Load Balancer (HAProxy / NGINX / Envoy)
│
├── Inference Node Group A — LLM (4 nodes)
│ └── 2× AMD EPYC 9654 + 8× NVIDIA H100 SXM5 80 GB
│ vLLM + TensorRT-LLM, FP8 inference
│
├── Inference Node Group B — Image/Video (4 nodes)
│ └── 2× Intel Xeon + 4× NVIDIA L40S 48 GB
│ NVIDIA Triton IS, TensorRT
│
├── Small Model / CPU Fallback Nodes (2 nodes)
│ └── 2× EPYC 9654 (ONNX Runtime, OpenVINO)
│ Low-priority or cost-oriented requests
│
├── Model Repository (NFS / S3-compatible)
│ └── Model weights, ONNX graphs, TensorRT engines
│
└── Monitoring and Management
└── Prometheus + Grafana, DCGM Exporter, NVIDIA MIG management
Network: 100 Gbit Ethernet (low-latency network between inference nodes); LACP connection from load balancer to nodes.
Performance Optimization: Layered Approach
Latency and throughput optimization in inference infrastructure is not achieved with a single adjustment; it is addressed in layers.
1. Model Layer
| Technique | Effect | Suitable Model Type |
|---|---|---|
| TensorRT conversion | 2–5× speedup, lower memory | CNN, ViT, small encoders |
| INT8 / FP8 quantization | 1.5–3× throughput increase | LLM, especially H100 FP8 |
| Model pruning | Variable; 20–50% latency improvement | Fine-tuned LLM, ResNet variants |
| Speculative decoding | 2–3× increase in LLM decode speed | Large autoregressive models |
| Continuous batching | Minimizes GPU idle time | All LLM scenarios |
2. System Layer
- NVIDIA MIG (Multi-Instance GPU): Divides a single physical GPU into multiple isolated partitions; increases GPU utilization efficiency for small model workloads.
- CUDA Stream parallelization: Executing multiple model inferences simultaneously on the GPU.
- Huge pages and NUMA alignment: Reduces latency in CPU–GPU data transfer.
- NVLink / NVSwitch: Low-latency communication for tensor parallelism in multi-GPU LLM scenarios.
3. Infrastructure Layer
- Pre-warmed model cache: All models kept ready in GPU memory to reduce model loading latency to zero.
- Weight sharing: Shared model weights across multiple workers (via Triton IS model ensemble) rather than memory copies.
- Connection pooling: Persistent connection management eliminating TCP connection setup cost between client and server.
High Availability: Zero Downtime Goal
The primary scenarios threatening availability in inference services are: hardware failure, software hangs, model updates, and planned maintenance. Each requires different countermeasures.
Hardware Failure Protection
- N+1 node configuration: If one node goes down, traffic is routed to remaining nodes
- GPU failure detection: Continuous health monitoring with NVIDIA DCGM; faulty GPU automatically taken offline
- Power redundancy: Dual PSU, UPS, generator backup
Model Update — Zero Downtime Deployment
Current model v1.0 → receiving all traffic
│
├── New model v1.1 loading (separate slot)
├── Health check passed
├── Canary: 5% traffic → v1.1
├── Canary: 20% traffic → v1.1
├── Full switch: 100% traffic → v1.1
└── v1.0 being drained
NVIDIA Triton Inference Server offers model versioning and A/B test support directly through its API.
SLA Levels
| Target | Parameter | Typical Value |
|---|---|---|
| Availability | Monthly uptime | ≥ 99.9% (≤ 43 minutes/month downtime) |
| Latency (P50) | Median response time | < 50 ms (small models) |
| Latency (P99) | Queue latency | < 200 ms (LLM tok/s loaded) |
| Throughput | Requests per second | Defined per workload |
| Model update | Downtime | Zero (canary/blue-green) |
Security and Data Sovereignty
In enterprise inference infrastructure, model weights, inference inputs, and outputs fall within the scope of sensitive commercial data.
- KVKK compliance: Inference requests and responses containing user data may create a legal obligation not to leave Turkish borders. Mevasis infrastructure is Turkey-located.
- Model weight protection: Custom fine-tuned models are protected with access control, network isolation, and encrypted storage.
- Network isolation: Inference API is accessible only from authorized sources; model repository and compute layer are isolated from the external network through internal network segmentation.
- Audit logging: Recording which model was called, when, and by whom — particularly mandatory for fintech and healthcare applications.
Example Configuration: LLM Serving (Llama Series)
# Triton model config example
name: "llama-70b-fp8"
backend: "tensorrtllm"
max_batch_size: 128
instance_group:
- kind: KIND_GPU
count: 1
gpus: [0, 1, 2, 3] # 4× H100, tensor parallel
dynamic_batching:
preferred_batch_size: [1, 4, 8, 16, 32]
max_queue_delay_microseconds: 5000
parameters:
max_tokens_in_paged_kv_cache: 131072
kv_cache_free_gpu_mem_fraction: 0.9
enable_chunked_context: true
executor_worker_path: "/opt/tritonserver/backends/tensorrtllm"
This configuration runs a 70B-parameter model with FP8 on 4× H100; continuous batching minimizes per-token latency while maximizing GPU utilization.
Monitoring: Inference-Specific Metrics
Beyond general system monitoring, critical metrics for inference services are:
- Request latency distribution (P50/P95/P99): Queue metrics, not averages, should be monitored
- Model ratio in GPU memory: How much of model weights are kept warm in GPU
- KV-cache fill rate: In LLM scenarios, throughput drops when cache fills
- Token generation rate (tokens/second): Primary indicator of LLM performance
- Queue depth: Number of requests waiting to be processed; trigger for scaling decisions
- Batch fill rate: Capacity efficiency; low value indicates GPU waste
Mevasis configures inference cluster monitoring with the Prometheus + Grafana and NVIDIA DCGM Exporter combination; automatic alerts and flexible scaling triggers are set up for critical metrics.
Mevasis Inference Cluster Services
Mevasis designs, installs, and manages production AI inference infrastructure end-to-end:
- GPU Cluster Rental: Inference-focused H100/L40S GPU clusters, monthly or project-based rental
- HPC Infrastructure Setup: Triton IS, vLLM, TensorRT-LLM installation and optimization
- Managed HPC Service: 24/7 monitoring, SLA guarantee, model update support
- HPC Consulting: Performance analysis and optimization of your existing inference infrastructure
Contact us for your production inference needs — workload analysis and sizing consulting is provided free of charge.
Frequently Asked Questions
Should I use vLLM or Triton TRT-LLM? The two have different strengths. vLLM is quick to set up, offers broad model support, and integrates easily into existing applications via OpenAI API compatibility. TensorRT-LLM clearly leads when maximum performance is targeted on NVIDIA hardware — especially with H100 FP8 — but requires a compilation step. Mevasis has hands-on experience with both scenarios.
Which LLM size can be run with how many GPUs? Depends on model size and precision. A 7B model runs on a single A100 80 GB in FP16. A 70B model requires tensor parallelism with 4× H100; 405B+ models require 8+ GPUs and model parallelism. For correct sizing, workload requirements (requests/second, max context length, latency target) should be evaluated together.
What does my inference cluster need for KVKK compliance? Every inference request and response containing user data (prompts, personal information) must be processed and stored in Turkey. Mevasis infrastructure is Turkey-located; access control, encrypted communication (TLS), and audit logging are provided as standard. Additional BDDK/KVKK requirements for fintech and healthcare applications are included in the evaluation.
Will there be downtime during model updates? When properly configured, zero downtime is achieved. With NVIDIA Triton’s model version management and canary deployment pipeline, the new model version is gradually rolled out while the current service remains active without interruption.
Is GPU required for our small models, or is CPU sufficient? BERT-based encoder models, small classification and NER models, distilled models — these run efficiently on Intel Xeon or AMD EPYC with ONNX Runtime in high-volume, latency-tolerant scenarios. The cost advantage of GPU becomes clear only for large models or very high QPS scenarios. Mevasis analyzes the CPU and GPU workload balance with a cost-focused approach and makes recommendations.