#performance

34 posts

vLLM × HPC-Ops: High-Performance Attention and MoE Backends from Tencent Hunyuan

Jul 6, 2026·12 min read

How HPC-Ops integrates Hopper-optimized attention and FP8 MoE backends into vLLM for Tencent Hunyuan Hy3, improving mixed-length decode, MoE latency, TTFT, and TPOT on NVIDIA H20.

Experience and Lessons Learned from Serving Multi-Stage Qwen3-Omni in vLLM-Omni

Jul 1, 2026·12 min read

How vLLM-Omni serves and optimizes Qwen3-Omni with staged Thinker-Talker-Code2Wav execution, batching, CUDA Graphs, async chunk, async output, replicas, hot-path cleanup, and perf validation.

Engineering TTS Inference in vLLM-Omni

Jun 23, 2026·23 min read

How vLLM-Omni supports and optimizes Qwen3-TTS, VoxCPM2, Higgs Audio V3, and Fish Speech S2 Pro with staged serving, batching, CUDA Graphs, and model-specific kernels.

Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon LLM Agents

Jun 2, 2026·14 min read

How Session-Aware Agentic Routing in vLLM Semantic Router preserves long-horizon agent continuity with session memory, safe model-switch boundaries, prefix-cache-aware switch pricing, and replayable traces.

From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router

May 28, 2026·12 min read

How vLLM Semantic Router hardens multimodal routing by turning visual evidence into trustworthy signals, debugging a Rust/Candle vision-encoder parity issue, and validating image signal correctness for production policy.

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

May 26, 2026·4 min read

How EAGLE 3.1 improves speculative decoding robustness in vLLM with FC normalization, post-norm hidden-state feedback, TorchSpec training support, and config-driven compatibility with EAGLE 3 checkpoints.

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache

May 18, 2026·13 min read

How PegaFlow integrates with vLLM as an external KV cache service, using a Rust daemon, CUDA IPC, RDMA, SSD caching, and the external KV connector to improve startup, sharing, throughput, and cache lifecycle.

vLLM Tops the Artificial Analysis Leaderboard

May 11, 2026·15 min read

How vLLM achieved leading Artificial Analysis results for DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B using open-source kernel fusion, speculative decoding, Blackwell optimizations, and model-specific serving work.

The State of FP8 KV-Cache and Attention Quantization in vLLM

Apr 22, 2026·21 min read

What vLLM FP8 KV-cache validation found across Hopper and Blackwell, covering attention quantization, Flash Attention 3 fixes, memory savings, decode speedups, and layers to skip.

Model Runner V2: A Modular and Faster Core for vLLM

Mar 24, 2026·8 min read

How Model Runner V2 reworks vLLM's execution core with modular model logic, GPU-native input preparation, stable persistent batching, async-first scheduling, and no API changes.

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Mar 13, 2026·12 min read

How P-EAGLE brings parallel speculative decoding to vLLM by generating multiple draft tokens in one forward pass, with pre-trained drafter heads, config support, and B200 speedups over EAGLE-3.

vLLM Triton Attention Backend Deep Dive

Mar 4, 2026·10 min read

A technical walkthrough of the vLLM Triton attention backend, covering performance-portable paged attention kernels, backend selection, autotuning, CUDA graph behavior, benchmarks, and NVIDIA, AMD, and Intel support.

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Feb 27, 2026·19 min read

How vLLM orchestrates high-performance inference on AMD ROCm with multiple attention backends, workload-aware prefill, extend, and decode routing, AITER primitives, MLA support, and MI300X-class benchmarks.

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Feb 26, 2026·11 min read

How vLLM serves many fine-tuned MoE and dense models with Multi-LoRA, including fused MoE LoRA kernels, Triton compiler fixes, Split-K and CTA swizzling optimizations, and SageMaker AI and Bedrock tuning.

DeepSeek-V3.2 on GB300: Performance Breakthrough

Feb 13, 2026·12 min read

What DeepSeek-V3.2 and DeepSeek-R1 benchmark results show on NVIDIA GB300 with vLLM, covering NVFP4 quantization, TP and EP deployment, throughput, and reproducible setup details.

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Feb 3, 2026·10 min read

How vLLM improves WideEP and large-scale DeepSeek-style MoE serving on NVIDIA GB200 with NVFP4 and FP8 kernels, fusion, prefill/decode disaggregation, weight offloading, and reduced chunking overhead.

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

Feb 1, 2026·8 min read

How vLLM and NVIDIA optimized GPT-OSS on Blackwell with FlashInfer, torch.compile fusion, FP8 KV cache, async scheduling, stream interval tuning, and deployment recipes that improve throughput and interactivity.

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Jan 8, 2026·15 min read

How vLLM's asynchronous KV offloading connector stores KV cache in CPU memory to reduce recomputation, improve throughput under memory pressure, and support pluggable offload backends.

vLLM-Omni Diffusion Cache Acceleration

Dec 19, 2025·4 min read

How vLLM-Omni speeds up diffusion model inference with Cache-DiT and TeaCache, reusing intermediate computations across timesteps to deliver 1.5x to 2x image generation speedups with minimal quality loss.

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Dec 17, 2025·8 min read

How vLLM reaches 2.2k tokens per second per H200 for DeepSeek-style MoE serving with Wide-EP, async scheduling, dual-batch overlap, disaggregated serving, CUDA graphs, DeepGEMM, and expert load balancing.

Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems

Nov 13, 2025·7 min read

How shared memory IPC caching in vLLM reduces redundant data transfers for multimodal and multi-process inference, improving prefill throughput and TTFT by sharing large inputs across coordinator and worker processes.

No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan

Nov 10, 2025·6 min read

How vLLM and TorchTitan demonstrate bitwise consistent on-policy RL by matching training and inference numerics, using batch-invariant kernels to reduce train-inference mismatch and stabilize reinforcement learning.

Zero-Reload Model Switching with vLLM Sleep Mode

Oct 26, 2025·17 min read

How vLLM Sleep Mode enables fast model switching by hibernating weights to CPU RAM or discarding them while preserving process state, CUDA graphs, allocators, and kernel warmup to avoid full reloads.

SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference

Oct 9, 2025·8 min read

How vLLM and NVIDIA optimize Blackwell inference for SemiAnalysis InferenceMAX, improving gpt-oss 120B and Llama 3.3 70B throughput with FP4 kernels, scheduling work, and Pareto-frontier benchmarking.

Introduction to torch.compile and How It Works with vLLM

Aug 20, 2025·14 min read

How torch.compile works inside vLLM, including TorchDynamo graph capture, TorchInductor code generation, custom compiler passes, graph breaks, model compilation strategy, and performance optimization.

vLLM Now Supports gpt-oss

Aug 5, 2025·5 min read

How vLLM supports gpt-oss 20B and 120B on NVIDIA Blackwell, Hopper, and AMD GPUs, with MXFP4 MoE kernels, efficient attention, hybrid KV cache allocation, and built-in tool support.

MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference

Jun 30, 2025·6 min read

How vLLM serves MiniMax-M1's hybrid MoE architecture for long-context inference, covering model deployment, memory management, batched serving, backend optimizations, and Docker-based setup.

vLLM V1: A Major Upgrade to vLLM's Core Architecture

Jan 27, 2025·11 min read

What changed in vLLM V1: a re-architected engine with a simpler scheduler, near-zero-overhead prefix caching, cleaner tensor parallelism, multiprocessing API server, and default optimizations for higher-throughput serving.

Structured Decoding in vLLM: a gentle introduction

Jan 14, 2025·12 min read

How structured decoding works in vLLM, covering JSON outputs, grammar-guided generation, outlines, XGrammar, TPOT improvements, constrained decoding, and agentic workflow use cases.

Serving LLMs on AMD MI300X: Best Practices

Oct 23, 2024·15 min read

Best practices for serving LLMs with vLLM on AMD MI300X, covering ROCm setup, Llama 3.1 70B and 405B benchmarks, chunked prefill, multi-step scheduling, prefix caching, graph capture, and AMD tuning.

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Oct 17, 2024·10 min read

How speculative decoding works in vLLM, covering EAGLE, Medusa, n-gram proposals, draft and target runners, scheduler and memory-manager changes, and continuous batching for lower token latency.

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

Sep 5, 2024·12 min read

What changed in vLLM v0.6.0 to improve throughput and latency, including API server isolation, reduced CPU overhead, multi-step scheduling, async execution, and benchmarks against earlier vLLM versions.

Notes on vLLM v.s. DeepSpeed-FastGen

Nov 14, 2023·4 min read

A performance comparison of vLLM and DeepSpeed-FastGen, explaining when Dynamic SplitFuse helps, where vLLM is faster, and how memory allocation, output length, and workload shape affect throughput.

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Jun 20, 2023·8 min read

What the original vLLM launch announced: PagedAttention for KV cache management, up to 24x throughput over Hugging Face Transformers, and lower-cost high-throughput LLM serving.