Blog

Deep dives into inference engineering, performance breakthroughs, new model support, and the latest from the vLLM community.

Featured

Inside vLLM: Anatomy of a High-Throughput LLM Inference System

Sep 5, 2025·41 min read

In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown...

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

May 26, 2026·4 min read

The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and...

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache

May 18, 2026·13 min read

TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...

Elastic Expert Parallelism in vLLM

May 14, 2026·11 min read

Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...

Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models

May 14, 2026·7 min read

We are excited to announce the pre-release of VeRL-Omni, a general reinforcement learning (RL) post-training framework focused on multimodal generative models, built on top of verl and vllm-omni.

A First Comprehensive Study of TurboQuant: Accuracy and Performance

May 11, 2026·12 min read

TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...

vLLM Tops the Artificial Analysis Leaderboard

May 11, 2026·15 min read

How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.

Serving Agentic Workloads at Scale with vLLM x Mooncake

May 6, 2026·10 min read

TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...

Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM

Apr 28, 2026·7 min read

We are excited to support the newly released NVIDIA Nemotron 3 Nano Omni model on vLLM.

DeepSeek V4 in vLLM: Efficient Long-context Attention

Apr 24, 2026·17 min read

A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.

The State of FP8 KV-Cache and Attention Quantization in vLLM

Apr 22, 2026·21 min read

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

Disaggregated Serving for Hybrid SSM Models in vLLM

Apr 21, 2026·15 min read

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time...

vLLM Korea Meetup 2026 Wrap-Up

Apr 14, 2026·7 min read

Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd.

Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation

Apr 7, 2026·22 min read

TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD's MORI-IO connector — achieving 2.5x...

Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models

Apr 2, 2026·3 min read

With the debut of Gemma 4, vLLM introduces immediate support for Google's most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,...

Extracting hidden states from vLLM

Mar 30, 2026·8 min read

PR #33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its...

Model Runner V2: A Modular and Faster Core for vLLM

Mar 24, 2026·8 min read

We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Mar 13, 2026·12 min read

EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...

Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM

Mar 11, 2026·5 min read

We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM.

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

Mar 10, 2026·23 min read

Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...

vLLM Triton Attention Backend Deep Dive

Mar 4, 2026·10 min read

This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Feb 27, 2026·19 min read

For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Feb 26, 2026·11 min read

Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...

DeepSeek-V3.2 on GB300: Performance Breakthrough

Feb 13, 2026·12 min read

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Feb 3, 2026·10 min read

Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

Feb 1, 2026·8 min read

TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...

Streaming Requests & Realtime API in vLLM

Jan 31, 2026·19 min read

Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...

Building Mixture-of-Models on AMD GPUs with vLLM-SR

Jan 23, 2026·10 min read

We are working on building the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems.

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Jan 8, 2026·15 min read

In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...

vLLM Semantic Router v0.1 Iris: The First Major Release

Jan 5, 2026·9 min read

vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from...

Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers

Jan 2, 2026·6 min read

As a passionate vLLM community member who wants to see vLLM thrive and reach even more developers, I'm excited to announce vLLM Playground – a modern, feature-rich web interface for managing and...

Announcing vllm.ai Website and Some Community Updates

Dec 27, 2025·3 min read

For a long time, vllm.ai simply redirected to the vLLM GitHub page. Thanks to our community, we now have a brand-new vllm.ai website, drawing inspiration from the PyTorch website.

vLLM-Omni Diffusion Cache Acceleration

Dec 19, 2025·4 min read

We are thrilled to announce a major performance update for vLLM-Omni.

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Dec 17, 2025·8 min read

In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...

AMD × vLLM Semantic Router: Building the System Intelligence Together

Dec 16, 2025·11 min read

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...

Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM

Dec 15, 2025·5 min read

Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation...

Encoder Disaggregation for Scalable Multimodal Model Serving

Dec 15, 2025·9 min read

Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...

Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

Dec 14, 2025·13 min read

Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right...

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Dec 13, 2025·11 min read

- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

Dec 13, 2025·5 min read

Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...

Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor

Dec 9, 2025·5 min read

Achieve faster, more efficient LLM serving without sacrificing accuracy!

Tracing Hanging and Complicated GPU Kernels Down To The Source Code

Dec 3, 2025·13 min read

Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...

Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

Nov 30, 2025·4 min read

We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.

Streamlined multi-node serving with Ray symmetric-run

Nov 22, 2025·4 min read

Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...

Building Clean, Maintainable vLLM Modifications Using the Plugin System

Nov 20, 2025·12 min read

Source: https://github.com/vllm-project/vllm-ascend

Docker Model Runner Integrates vLLM for High-Throughput Inferencing

Nov 19, 2025·6 min read

Today, we're excited to announce that Docker Model Runner now integrates the vLLM inference engine and safetensors models, unlocking high-throughput AI inference with the same Docker tooling you...

Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

Nov 19, 2025·14 min read

The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain categories, and then...

Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems

Nov 13, 2025·7 min read

Introducing Shared Memory IPC Caching — a high-performance caching mechanism contributed by Cohere to the vLLM project. By bypassing redundant inter-process communication and keeping large...

Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM

Nov 11, 2025·9 min read

Intel® Arc™ Pro B-Series GPU Family GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and scalability...

No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan

Nov 10, 2025·6 min read

We demonstrate an open-source bitwise consistent on-policy RL run with TorchTitan as the training engine and vLLM as the inference engine. Built on top of vLLM's recent work on batch-invariant...

Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM

Oct 31, 2025·4 min read

We are excited to release NVIDIA Nemotron Nano 2 VL, supported by vLLM. This open vision language model (VLM) is built for video understanding and document intelligence.

Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM

Oct 28, 2025·11 min read

TL;DR: For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8863059dd8afc00937f054e1365abbd (Kimi-K2-0905) or commit...

From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA

Oct 27, 2025·8 min read

Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number...

Zero-Reload Model Switching with vLLM Sleep Mode

Oct 26, 2025·17 min read

The multi-model serving problem: You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad tradeoff:

Now Serving NVIDIA Nemotron with vLLM

Oct 23, 2025·5 min read

Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in developer applications. To build these systems, developers need tools that are...

No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL

Oct 22, 2025·9 min read

TL;DR. Agent often calls LLMs via OpenAI‑compatible endpoints, which previously return only string-based inputs and outputs. In agent RL, this can lead to inconsistencies between training and...

vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU

Oct 16, 2025·11 min read

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path. It is not only faster than the previous generation...

SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference

Oct 9, 2025·8 min read

Over the past several months, we’ve been collaborating closely with NVIDIA to unlock the full potential of their latest NVIDIA Blackwell GPU architecture (B200/GB200) for large language model...

DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action

Sep 29, 2025·8 min read

We are excited to announce Day 0 support for DeepSeek-V3.2-Exp, featuring DeepSeek Sparse Attention (DSA) (paper) designed for long context tasks. In this post, we showcase how to use this model...

The First vLLM Meetup in Korea

Sep 16, 2025·5 min read

The first vLLM meetup in Korea was held on August 19, 2025, in Seoul, hosted by Rebellions and Red Hat with support from PyTorch Korea User Group and SqueezeBits.

vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency

Sep 11, 2025·3 min read

We’re excited to announce that vLLM now supports Qwen3-Next, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a hybrid architecture with extreme efficiency for...

vLLM Semantic Router: Next Phase in LLM inference

Sep 11, 2025·4 min read

Over the past year, hybrid reasoning and automatic routing have increasingly defined progress in large-model infrastructure—shifting the debate from raw scale to per-token efficiency, latency...

Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM

Sep 5, 2025·10 min read

Until recently, generative AI infrastructure has been tightly coupled with autoregressive text generation models that produce output token-by-token, typically in the form of natural language. vLLM...

Introduction to torch.compile and How It Works with vLLM

Aug 20, 2025·14 min read

Fast large language model (LLM) inference today requires executing models as efficiently as possible across diverse hardware, workloads, and scale. Efficient execution requires heavily optimized...

GLM-4.5 Meets vLLM: Built for Intelligent Agents

Aug 19, 2025·4 min read

General Language Model (GLM) is a family of foundation models created by Zhipu.ai (now renamed to Z.ai). The GLM team has long-term collaboration with vLLM team, dating back to the early days of...

CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond

Aug 11, 2025·16 min read

TL;DR: If you hit an illegal memory access was encountered error, you can enable CUDA core dump to debug the issue. Simply set the following environment variables and run your program again to...

vLLM Now Supports gpt-oss

Aug 5, 2025·5 min read

We're thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model...

MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference

Jun 30, 2025·6 min read

This article explores how MiniMax-M1's hybrid architecture is efficiently supported in vLLM. We discuss the model's unique features, the challenges of efficient inference, and the technical...

Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU

May 12, 2025·5 min read

Since December 2024, through the joint efforts of the vLLM community and the Ascend team on vLLM, we have completed the Hardware Pluggable RFC. This proposal allows hardware integration into vLLM...

Accelerating RLHF with vLLM, Best Practice from OpenRLHF

Apr 23, 2025·5 min read

As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF...

Transformers modeling backend integration in vLLM

Apr 11, 2025·6 min read

The Hugging Face Transformers library offers a flexible, unified interface to a vast ecosystem of model architectures. From research to fine-tuning on custom dataset, Transformers is the go-to...

Llama 4 in vLLM

Apr 5, 2025·4 min read

We're excited to announce that vLLM now supports the Llama 4 herd of models: Scout (17B-16E) and Maverick (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10...

PTPC-FP8: Boosting vLLM Performance on AMD ROCm

Feb 24, 2025·9 min read

TL;DR: vLLM on AMD ROCm now has better FP8 performance!

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

Feb 21, 2025·4 min read

Today, we are excited to announce vllm-project/aibrix: a battery-included vLLM Kubernetes serving stack developed by Bytedance. Started in early 2024, AIBrix has been successfully deployed to...

Distributed Inference with vLLM

Feb 17, 2025·5 min read

Serving large models often leads to memory bottlenecks, such as the dreaded CUDA out of memory error. To tackle this, there are two main solutions:

Introducing vLLM Inference Provider in Llama Stack

Jan 27, 2025·8 min read

We are excited to announce that vLLM inference provider is now available in Llama Stack through the collaboration between the Red Hat AI Engineering team and the Llama Stack team from Meta. This...

vLLM V1: A Major Upgrade to vLLM's Core Architecture

Jan 27, 2025·11 min read

We are thrilled to announce the alpha release of vLLM V1, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the past 1.5 years of vLLM development, we revisited key...

High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack

Jan 21, 2025·4 min read

- vLLM boasts the largest open-source community, but what does it take to transform vLLM from the best single-node LLM engine to a premier LLM serving system? - Today, we release “vLLM...

Structured Decoding in vLLM: a gentle introduction

Jan 14, 2025·12 min read

- Structured decoding allows precise control over LLM output formats - vLLM now supports both outlines and XGrammar backends for structured decoding - Recent XGrammar integration brings up to 5x...

Installing and Developing vLLM with Ease

Jan 10, 2025·5 min read

The field of LLM inference is advancing at an unprecedented pace. With new models and features emerging weekly, the traditional software release pipeline often struggles to keep up. At vLLM, we...

vLLM 2024 Retrospective and 2025 Vision

Jan 10, 2025·11 min read

The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to become the de facto serving solution for the open-source AI ecosystem. This transformation is...

Serving LLMs on AMD MI300X: Best Practices

Oct 23, 2024·15 min read

TL;DR: vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B....

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Oct 17, 2024·10 min read

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in...

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

Sep 5, 2024·12 min read

TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.

vLLM’s Open Governance and Performance Roadmap

Jul 25, 2024·4 min read

We would like to share two updates to the vLLM community.

Announcing Llama 3.1 Support in vLLM

Jul 23, 2024·6 min read

Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting new features with longer context length (up to 128K...

Notes on vLLM v.s. DeepSpeed-FastGen

Nov 14, 2023·4 min read

- vLLM matches DeepSpeed-FastGen's speed in common scenarios and surpasses it when handling longer outputs. - DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short...

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Jun 20, 2023·8 min read

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we...