AI Inference Cluster: High Availability and Low Latency

Moving a trained model into production creates very different engineering requirements from the training phase. Instead of batch processing, millisecond-level responses; instead of a single GPU server, hundreds of simultaneous requests; instead of an experimental environment, a 99.9% uptime commitment. Production inference is a separate infrastructure discipline focused on high availability and low latency.

Why Does Inference Require Different Infrastructure from Training?

Model training aims to complete a single job running for weeks: high fault tolerance, throughput-oriented, latency-tolerant. Inference is the exact opposite.

Low latency: Customer-facing applications typically define a P99 latency target of 20–100 ms. Exceeding this value directly degrades the user experience.
High concurrency: Thousands of requests can reach the same model per second; infrastructure must handle these without queuing or resource contention.
Scaling elasticity: Traffic fluctuates. Load that drops to a minimum at night can multiply within minutes during a campaign or crisis event.
High availability: Service outages mean revenue loss and brand damage; SLA guarantees are mandatory.
Model versioning: Zero-downtime deployment of new model versions (blue-green or canary deployment) has become a production requirement.

Typical Inference Workloads

Large Language Model (LLM) Serving

GPT-style autoregressive models (Llama, Mistral, Falcon series, and enterprise fine-tuned variants) represent the most challenging inference scenarios due to their per-token compute intensity. Response time depends on both prefill and decode phases; both phases must be optimized separately.

Critical software components:

vLLM: Dramatically increases KV-cache efficiency with PagedAttention; provides OpenAI-compatible API
TensorRT-LLM: Optimized version of NVIDIA’s LLM inference engine; INT8/FP8 quantization support on H100/A100
SGLang: High throughput for complex multi-step prompt structures

Image and Video Processing

Object detection, segmentation, optical character recognition, and video analysis pipelines — especially in streaming scenarios — require critical management of simultaneous GPU threads.

Frameworks used: NVIDIA Triton Inference Server, ONNX Runtime, TensorRT, OpenVINO (for Intel CPU fallback)

Multimodal and Embedded Model Serving

Running audio-image-text models such as CLIP, LLaVA, and Whisper together on the same cluster requires special configuration for resource isolation and load balancing.

Enterprise Recommendation and Personalization Engines

Real-time recommendation systems in e-commerce, fintech, and media sectors target the combination of low latency (<10 ms) and very high QPS (queries per second). In these scenarios, models are small but throughput requirements are extremely high.

Reference Inference Cluster Architecture

Load Balancer (HAProxy / NGINX / Envoy)
│
├── Inference Node Group A — LLM (4 nodes)
│   └── 2× AMD EPYC 9654 + 8× NVIDIA H100 SXM5 80 GB
│       vLLM + TensorRT-LLM, FP8 inference
│
├── Inference Node Group B — Image/Video (4 nodes)
│   └── 2× Intel Xeon + 4× NVIDIA L40S 48 GB
│       NVIDIA Triton IS, TensorRT
│
├── Small Model / CPU Fallback Nodes (2 nodes)
│   └── 2× EPYC 9654 (ONNX Runtime, OpenVINO)
│       Low-priority or cost-oriented requests
│
├── Model Repository (NFS / S3-compatible)
│   └── Model weights, ONNX graphs, TensorRT engines
│
└── Monitoring and Management
    └── Prometheus + Grafana, DCGM Exporter, NVIDIA MIG management

Network: 100 Gbit Ethernet (low-latency network between inference nodes); LACP connection from load balancer to nodes.

Performance Optimization: Layered Approach

Latency and throughput optimization in inference infrastructure is not achieved with a single adjustment; it is addressed in layers.

1. Model Layer

Technique	Effect	Suitable Model Type
TensorRT conversion	2–5× speedup, lower memory	CNN, ViT, small encoders
INT8 / FP8 quantization	1.5–3× throughput increase	LLM, especially H100 FP8
Model pruning	Variable; 20–50% latency improvement	Fine-tuned LLM, ResNet variants
Speculative decoding	2–3× increase in LLM decode speed	Large autoregressive models
Continuous batching	Minimizes GPU idle time	All LLM scenarios

2. System Layer

NVIDIA MIG (Multi-Instance GPU): Divides a single physical GPU into multiple isolated partitions; increases GPU utilization efficiency for small model workloads.
CUDA Stream parallelization: Executing multiple model inferences simultaneously on the GPU.
Huge pages and NUMA alignment: Reduces latency in CPU–GPU data transfer.
NVLink / NVSwitch: Low-latency communication for tensor parallelism in multi-GPU LLM scenarios.

3. Infrastructure Layer

Pre-warmed model cache: All models kept ready in GPU memory to reduce model loading latency to zero.
Weight sharing: Shared model weights across multiple workers (via Triton IS model ensemble) rather than memory copies.
Connection pooling: Persistent connection management eliminating TCP connection setup cost between client and server.

High Availability: Zero Downtime Goal

The primary scenarios threatening availability in inference services are: hardware failure, software hangs, model updates, and planned maintenance. Each requires different countermeasures.

Hardware Failure Protection

N+1 node configuration: If one node goes down, traffic is routed to remaining nodes
GPU failure detection: Continuous health monitoring with NVIDIA DCGM; faulty GPU automatically taken offline
Power redundancy: Dual PSU, UPS, generator backup

Model Update — Zero Downtime Deployment

Current model v1.0 → receiving all traffic
  │
  ├── New model v1.1 loading (separate slot)
  ├── Health check passed
  ├── Canary: 5% traffic → v1.1
  ├── Canary: 20% traffic → v1.1
  ├── Full switch: 100% traffic → v1.1
  └── v1.0 being drained

NVIDIA Triton Inference Server offers model versioning and A/B test support directly through its API.

SLA Levels

Target	Parameter	Typical Value
Availability	Monthly uptime	≥ 99.9% (≤ 43 minutes/month downtime)
Latency (P50)	Median response time	< 50 ms (small models)
Latency (P99)	Queue latency	< 200 ms (LLM tok/s loaded)
Throughput	Requests per second	Defined per workload
Model update	Downtime	Zero (canary/blue-green)

Security and Data Sovereignty

In enterprise inference infrastructure, model weights, inference inputs, and outputs fall within the scope of sensitive commercial data.

KVKK compliance: Inference requests and responses containing user data may create a legal obligation not to leave Turkish borders. Mevasis infrastructure is Turkey-located.
Model weight protection: Custom fine-tuned models are protected with access control, network isolation, and encrypted storage.
Network isolation: Inference API is accessible only from authorized sources; model repository and compute layer are isolated from the external network through internal network segmentation.
Audit logging: Recording which model was called, when, and by whom — particularly mandatory for fintech and healthcare applications.

Example Configuration: LLM Serving (Llama Series)

# Triton model config example
name: "llama-70b-fp8"
backend: "tensorrtllm"
max_batch_size: 128

instance_group:
  - kind: KIND_GPU
    count: 1
    gpus: [0, 1, 2, 3]   # 4× H100, tensor parallel

dynamic_batching:
  preferred_batch_size: [1, 4, 8, 16, 32]
  max_queue_delay_microseconds: 5000

parameters:
  max_tokens_in_paged_kv_cache: 131072
  kv_cache_free_gpu_mem_fraction: 0.9
  enable_chunked_context: true
  executor_worker_path: "/opt/tritonserver/backends/tensorrtllm"

This configuration runs a 70B-parameter model with FP8 on 4× H100; continuous batching minimizes per-token latency while maximizing GPU utilization.

Monitoring: Inference-Specific Metrics

Beyond general system monitoring, critical metrics for inference services are:

Request latency distribution (P50/P95/P99): Queue metrics, not averages, should be monitored
Model ratio in GPU memory: How much of model weights are kept warm in GPU
KV-cache fill rate: In LLM scenarios, throughput drops when cache fills
Token generation rate (tokens/second): Primary indicator of LLM performance
Queue depth: Number of requests waiting to be processed; trigger for scaling decisions
Batch fill rate: Capacity efficiency; low value indicates GPU waste

Mevasis configures inference cluster monitoring with the Prometheus + Grafana and NVIDIA DCGM Exporter combination; automatic alerts and flexible scaling triggers are set up for critical metrics.

Mevasis Inference Cluster Services

Mevasis designs, installs, and manages production AI inference infrastructure end-to-end:

GPU Cluster Rental: Inference-focused H100/L40S GPU clusters, monthly or project-based rental
HPC Infrastructure Setup: Triton IS, vLLM, TensorRT-LLM installation and optimization
Managed HPC Service: 24/7 monitoring, SLA guarantee, model update support
HPC Consulting: Performance analysis and optimization of your existing inference infrastructure

Frequently Asked Questions

Should I use vLLM or Triton TRT-LLM? The two have different strengths. vLLM is quick to set up, offers broad model support, and integrates easily into existing applications via OpenAI API compatibility. TensorRT-LLM clearly leads when maximum performance is targeted on NVIDIA hardware — especially with H100 FP8 — but requires a compilation step. Mevasis has hands-on experience with both scenarios.

Which LLM size can be run with how many GPUs? Depends on model size and precision. A 7B model runs on a single A100 80 GB in FP16. A 70B model requires tensor parallelism with 4× H100; 405B+ models require 8+ GPUs and model parallelism. For correct sizing, workload requirements (requests/second, max context length, latency target) should be evaluated together.

What does my inference cluster need for KVKK compliance? Every inference request and response containing user data (prompts, personal information) must be processed and stored in Turkey. Mevasis infrastructure is Turkey-located; access control, encrypted communication (TLS), and audit logging are provided as standard. Additional BDDK/KVKK requirements for fintech and healthcare applications are included in the evaluation.

Will there be downtime during model updates? When properly configured, zero downtime is achieved. With NVIDIA Triton’s model version management and canary deployment pipeline, the new model version is gradually rolled out while the current service remains active without interruption.

Is GPU required for our small models, or is CPU sufficient? BERT-based encoder models, small classification and NER models, distilled models — these run efficiently on Intel Xeon or AMD EPYC with ONNX Runtime in high-volume, latency-tolerant scenarios. The cost advantage of GPU becomes clear only for large models or very high QPS scenarios. Mevasis analyzes the CPU and GPU workload balance with a cost-focused approach and makes recommendations.