#large-scale-serving

13 posts

From Day 0 to Production SLAs: Serving GLM-5.2 on 24 NVIDIA B300 GPUs with vLLM

Jul 23, 2026·18 min read

How we took GLM-5.2-NVFP4 from 40 ms to 17 ms mean TPOT on 24 B300 GPUs with vLLM: P/D disaggregation, MTP speculative decoding, Model Runner V2, and the SLA-first trade-offs behind the final configuration.

Elastic Expert Parallelism in vLLM

May 14, 2026·11 min read

How Elastic Expert Parallelism lets vLLM scale Mixture-of-Experts serving up or down at runtime by changing data-parallel workers, redistributing experts, and coordinating live topology changes without server restarts.

Serving Agentic Workloads at Scale with vLLM x Mooncake

May 6, 2026·10 min read

How vLLM integrates Mooncake Store as a distributed KV cache for agentic workloads, reusing shared prefixes across turns and instances to improve throughput, TTFT, end-to-end latency, and multi-GPU scaling.

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Feb 3, 2026·10 min read

How vLLM improves WideEP and large-scale DeepSeek-style MoE serving on NVIDIA GB200 with NVFP4 and FP8 kernels, fusion, prefill/decode disaggregation, weight offloading, and reduced chunking overhead.

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Dec 17, 2025·8 min read

How vLLM reaches 2.2k tokens per second per H200 for DeepSeek-style MoE serving with Wide-EP, async scheduling, dual-batch overlap, disaggregated serving, CUDA graphs, DeepGEMM, and expert load balancing.

Encoder Disaggregation for Scalable Multimodal Model Serving

Dec 15, 2025·9 min read

How vLLM EPD separates visual encoders from text prefill and decode, covering LMM serving, GPU resource scaling, multimodal interference, and pipelined execution.

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

Dec 13, 2025·5 min read

What vLLM Router provides for large-scale serving: Rust-based state-aware load balancing, KV-cache affinity, prefill/decode disaggregation orchestration, Kubernetes discovery, retries, circuit breakers, and Prometheus metrics.

Streamlined multi-node serving with Ray symmetric-run

Nov 22, 2025·4 min read

How Ray symmetric-run simplifies multi-node vLLM serving by launching the same entrypoint on every Ray cluster node, matching HPC and parallel SSH workflows for distributed model deployments.

Inside vLLM: Anatomy of a High-Throughput LLM Inference System

Sep 5, 2025·41 min read

How vLLM's inference engine works, covering PagedAttention, continuous batching, prefix caching, speculative decoding, multi-GPU serving, scheduling, and benchmarking for high-throughput LLM workloads.

Accelerating RLHF with vLLM, Best Practice from OpenRLHF

Apr 23, 2025·5 min read

How OpenRLHF uses vLLM, Ray, ZeRO-3, AutoTP, Ray placement groups, and weight synchronization to accelerate PPO and RLHF sample generation for reasoning models with long chain-of-thought outputs.

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

Feb 21, 2025·4 min read

What AIBrix adds as a Kubernetes control plane for vLLM: LoRA management, LLM gateway routing, autoscaling, unified runtime, distributed inference, distributed KV cache, heterogeneous serving, and GPU failure detection.

Distributed Inference with vLLM

Feb 17, 2025·5 min read

A guide to distributed inference in vLLM, covering tensor parallelism, pipeline parallelism, multi-GPU and multi-node serving, KV cache challenges, speculative decoding, communication kernels, and control-plane design.

High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack

Jan 21, 2025·4 min read

What vLLM production-stack adds for Kubernetes serving: prefix-aware routing, LMCache-backed KV cache sharing, autoscaling, observability, fault tolerance, and cluster deployment with higher throughput and lower latency.