Blog

Deep dives into inference engineering, performance breakthroughs, new model support, and the latest from the vLLM community.

Featured

Inside vLLM: Anatomy of a High-Throughput LLM Inference System

Sep 5, 2025·41 min read

How vLLM's inference engine works, covering PagedAttention, continuous batching, prefix caching, speculative decoding, multi-GPU serving, scheduling, and benchmarking for high-throughput LLM workloads.

Beyond One Model: Fusion in vLLM Semantic Router

Jun 16, 2026·10 min read

How vLLM Semantic Router Fusion runs a panel of models, uses a judge to analyze agreement and gaps, and synthesizes one answer while preserving routing policy, traces, and OpenAI-compatible serving.

MiniMax M3 in vLLM: Day-0 Serving for 1M-Token Multimodal Reasoning

Jun 12, 2026·21 min read

How vLLM serves MiniMax M3 with MiniMax Sparse Attention, multimodal and reasoning parsers, MXFP8 weights, and long-context deployment recipes.

DiffusionGemma: The First Diffusion LLM (dLLM) Natively Supported in vLLM

Jun 10, 2026·6 min read

How vLLM supports DiffusionGemma, the first native diffusion language model in vLLM, using Model Runner V2 state hooks, iterative denoising, bidirectional attention, and reused speculative decoding paths.

Announcing vime: A Simple, Stable, and Efficient RL Framework for LLMs

Jun 9, 2026·6 min read

vime connects slime's training stack with vLLM rollouts to provide a simple, stable, and efficient RL post-training pipeline.

vLLM Semantic Router v0.3 Themis: From Signals to Stateful Production Routing

Jun 5, 2026·22 min read

What vLLM Semantic Router v0.3 Themis adds for production routing: canonical config, inspectable signal-decision-policy flows, safer operations, CLI/dashboard/Kubernetes alignment, and replayable routing behavior.

Announcing Day-0 Support for NVIDIA Nemotron 3 Ultra on vLLM

Jun 4, 2026·7 min read

How to serve NVIDIA Nemotron 3 Ultra with vLLM for long-running agentic reasoning, including BF16 and NVFP4 checkpoints, supported GPU configurations, OpenAI-compatible deployment, and NeMo RL integration.

Fast & Efficient LLM Inference with vLLM: A New Course with DeepLearning.AI

Jun 3, 2026·5 min read

What the DeepLearning.AI vLLM course teaches: optimizing, deploying, and benchmarking LLM inference with LLM Compressor quantization, GuideLLM, KV cache sizing, serving, and memory tradeoffs.

Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon LLM Agents

Jun 2, 2026·14 min read

How Session-Aware Agentic Routing in vLLM Semantic Router preserves long-horizon agent continuity with session memory, safe model-switch boundaries, prefix-cache-aware switch pricing, and replayable traces.

Accelerating vLLM-Omni Inference with AutoRound Quantization

Jun 2, 2026·10 min read

How AutoRound integrates with vLLM-Omni to serve W4A16 quantized multimodal, diffusion, image, and video models with smaller checkpoints, preserved quality, Intel XPU acceleration, and NVIDIA GPU support.

vLLM on the DGX Spark: Architecture, Configuration, and Local Evaluation

Jun 1, 2026·16 min read

How to run vLLM on NVIDIA DGX Spark and GB10 systems, including unified memory behavior, NVFP4 Nemotron-3-Super serving, Docker deployment, Prometheus metrics, and local evaluation results.

Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor

May 28, 2026·3 min read

How Laguna XS.2 is served and optimized in vLLM using first-class model integration, a DFlash speculator trained with Speculators, and FP8, NVFP4, INT4, and INT8 checkpoints from LLM Compressor.

Native RL APIs in vLLM

May 28, 2026·12 min read

How vLLM native RL APIs standardize weight syncing and asynchronous RL serving with NCCL and CUDA IPC transfer backends, pause mode, and fixes for fragile DPEP and disaggregated rollout deployments.

Speculators v0.5.0: DFlash Support and Online Training

May 28, 2026·6 min read

What Speculators v0.5.0 adds for vLLM speculative decoding: DFlash block-diffusion draft models, unified online and offline training, native hidden-state extraction, and Gemma 4 latency results.

From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router

May 28, 2026·12 min read

How vLLM Semantic Router hardens multimodal routing by turning visual evidence into trustworthy signals, debugging a Rust/Candle vision-encoder parity issue, and validating image signal correctness for production policy.

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

May 26, 2026·4 min read

How EAGLE 3.1 improves speculative decoding robustness in vLLM with FC normalization, post-norm hidden-state feedback, TorchSpec training support, and config-driven compatibility with EAGLE 3 checkpoints.

vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache

May 18, 2026·13 min read

How PegaFlow integrates with vLLM as an external KV cache service, using a Rust daemon, CUDA IPC, RDMA, SSD caching, and the external KV connector to improve startup, sharing, throughput, and cache lifecycle.

Elastic Expert Parallelism in vLLM

May 14, 2026·11 min read

How Elastic Expert Parallelism lets vLLM scale Mixture-of-Experts serving up or down at runtime by changing data-parallel workers, redistributing experts, and coordinating live topology changes without server restarts.

Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models

May 14, 2026·7 min read

How VeRL-Omni extends verl with vLLM-Omni for reinforcement learning post-training of diffusion and multimodal generative models, including efficient rollouts, reward inference, trainers, hardware support, and recipes.

A First Comprehensive Study of TurboQuant: Accuracy and Performance

May 11, 2026·12 min read

A vLLM study comparing TurboQuant KV-cache quantization with BF16 and FP8 across long-context and reasoning workloads, showing where 4-bit variants help, where accuracy drops, and why FP8 remains the default choice.

vLLM Tops the Artificial Analysis Leaderboard

May 11, 2026·15 min read

How vLLM achieved leading Artificial Analysis results for DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B using open-source kernel fusion, speculative decoding, Blackwell optimizations, and model-specific serving work.

Serving Agentic Workloads at Scale with vLLM x Mooncake

May 6, 2026·10 min read

How vLLM integrates Mooncake Store as a distributed KV cache for agentic workloads, reusing shared prefixes across turns and instances to improve throughput, TTFT, end-to-end latency, and multi-GPU scaling.

Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM

Apr 28, 2026·7 min read

How to serve NVIDIA Nemotron 3 Nano Omni with vLLM for multimodal agentic AI, including BF16, FP8, and NVFP4 checkpoints, vision/audio/video inputs, supported GPUs, OpenAI-compatible APIs, and deployment recipes.

DeepSeek V4 in vLLM: Efficient Long-context Attention

Apr 24, 2026·17 min read

A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.

The State of FP8 KV-Cache and Attention Quantization in vLLM

Apr 22, 2026·21 min read

What vLLM FP8 KV-cache validation found across Hopper and Blackwell, covering attention quantization, Flash Attention 3 fixes, memory savings, decode speedups, and layers to skip.

Disaggregated Serving for Hybrid SSM Models in vLLM

Apr 21, 2026·15 min read

How vLLM extends NIXL prefill/decode disaggregation to hybrid SSM-attention models with dual descriptor views, physical-logical block bridging, and Mamba conv-state transfer support.

vLLM Korea Meetup 2026 Wrap-Up

Apr 14, 2026·7 min read

What the vLLM Korea Meetup 2026 covered: community growth, vLLM V1 updates, production stack adoption, accelerator integration, vllm-playground, and real-world LLM serving.

Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation

Apr 7, 2026·22 min read

How single-node prefill/decode disaggregation in vLLM uses AMD MORI-IO on an 8-GPU MI300X node to separate prefill and decode, transfer KV cache efficiently, stabilize ITL, and improve goodput.

Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models

Apr 2, 2026·3 min read

How vLLM supports Google's Gemma 4 open models across NVIDIA, AMD, Intel, and TPU backends, with multimodal inputs, agentic workflows, long context, function calling, and deployment recipes.

Extracting hidden states from vLLM

Mar 30, 2026·8 min read

How vLLM extracts verifier hidden states through dummy draft models and KV Connector APIs for speculative decoding, enabling offline and online Speculators training without patching vLLM internals.

Model Runner V2: A Modular and Faster Core for vLLM

Mar 24, 2026·8 min read

How Model Runner V2 reworks vLLM's execution core with modular model logic, GPU-native input preparation, stable persistent batching, async-first scheduling, and no API changes.

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Mar 13, 2026·12 min read

How P-EAGLE brings parallel speculative decoding to vLLM by generating multiple draft tokens in one forward pass, with pre-trained drafter heads, config support, and B200 speedups over EAGLE-3.

Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM

Mar 11, 2026·5 min read

How to serve NVIDIA Nemotron 3 Super with vLLM for multi-agent AI, including BF16, FP8, and NVFP4 checkpoints, 1M-token context, Thinking Budget, MTP, supported GPUs, and OpenAI-compatible deployment.

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

Mar 10, 2026·23 min read

What vLLM Semantic Router v0.2 Athena adds: refreshed multilingual and multimodal routing models, ONNX and ROCm acceleration, safety and memory signals, long-context handling, and ClawOS orchestration.

vLLM Triton Attention Backend Deep Dive

Mar 4, 2026·10 min read

A technical walkthrough of the vLLM Triton attention backend, covering performance-portable paged attention kernels, backend selection, autotuning, CUDA graph behavior, benchmarks, and NVIDIA, AMD, and Intel support.

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Feb 27, 2026·19 min read

How vLLM orchestrates high-performance inference on AMD ROCm with multiple attention backends, workload-aware prefill, extend, and decode routing, AITER primitives, MLA support, and MI300X-class benchmarks.

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Feb 26, 2026·11 min read

How vLLM serves many fine-tuned MoE and dense models with Multi-LoRA, including fused MoE LoRA kernels, Triton compiler fixes, Split-K and CTA swizzling optimizations, and SageMaker AI and Bedrock tuning.

DeepSeek-V3.2 on GB300: Performance Breakthrough

Feb 13, 2026·12 min read

What DeepSeek-V3.2 and DeepSeek-R1 benchmark results show on NVIDIA GB300 with vLLM, covering NVFP4 quantization, TP and EP deployment, throughput, and reproducible setup details.

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Feb 3, 2026·10 min read

How vLLM improves WideEP and large-scale DeepSeek-style MoE serving on NVIDIA GB200 with NVFP4 and FP8 kernels, fusion, prefill/decode disaggregation, weight offloading, and reduced chunking overhead.

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

Feb 1, 2026·8 min read

How vLLM and NVIDIA optimized GPT-OSS on Blackwell with FlashInfer, torch.compile fusion, FP8 KV cache, async scheduling, stream interval tuning, and deployment recipes that improve throughput and interactivity.

Streaming Requests & Realtime API in vLLM

Jan 31, 2026·19 min read

How vLLM supports streamable inputs and a Realtime WebSocket API for audio, video, robotics, and low-latency applications that need incremental input processing instead of complete prompts.

Building Mixture-of-Models on AMD GPUs with vLLM-SR

Jan 23, 2026·10 min read

How vLLM Semantic Router builds a Mixture-of-Models system on AMD MI300X and MI355X GPUs, routing across specialized models with signals, decisions, safety checks, semantic caching, and live MoM deployment.

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Jan 8, 2026·15 min read

How vLLM's asynchronous KV offloading connector stores KV cache in CPU memory to reduce recomputation, improve throughput under memory pressure, and support pluggable offload backends.

vLLM Semantic Router v0.1 Iris: The First Major Release

Jan 5, 2026·9 min read

What vLLM Semantic Router v0.1 Iris introduces: signal-decision plugin architecture, model selection, safety filtering, semantic caching, hallucination detection, LoRA-based routing models, and production-ready MoM routing.

Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers

Jan 2, 2026·6 min read

How vLLM Playground provides a web UI for starting, configuring, testing, and monitoring vLLM servers across local macOS, Linux GPU or CPU, Kubernetes, and OpenShift environments.

Announcing vllm.ai Website and Some Community Updates

Dec 27, 2025·3 min read

What changed on the new vllm.ai website for vLLM users: installation guidance, events pages, Slack and X community channels, vLLM Daily updates, and a clearer project/community split.

vLLM-Omni Diffusion Cache Acceleration

Dec 19, 2025·4 min read

How vLLM-Omni speeds up diffusion model inference with Cache-DiT and TeaCache, reusing intermediate computations across timesteps to deliver 1.5x to 2x image generation speedups with minimal quality loss.

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Dec 17, 2025·8 min read

How vLLM reaches 2.2k tokens per second per H200 for DeepSeek-style MoE serving with Wide-EP, async scheduling, dual-batch overlap, disaggregated serving, CUDA graphs, DeepGEMM, and expert load balancing.

AMD × vLLM Semantic Router: Building the System Intelligence Together

Dec 16, 2025·11 min read

How AMD and vLLM Semantic Router build GPU-accelerated Mixture-of-Models routing with signals, semantic caching, response storage, PII, jailbreak, and hallucination guardrails.

Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM

Dec 15, 2025·5 min read

How to serve NVIDIA Nemotron 3 Nano with vLLM for efficient agentic AI, including BF16, FP8, and NVFP4 checkpoints, 1M-token context, hybrid MoE architecture, Thinking Budget, supported GPUs, and OpenAI-compatible deployment.

Encoder Disaggregation for Scalable Multimodal Model Serving

Dec 15, 2025·9 min read

How vLLM EPD separates visual encoders from text prefill and decode, covering LMM serving, GPU resource scaling, multimodal interference, and pipelined execution.

Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

Dec 14, 2025·13 min read

How HaluGate adds token-level hallucination detection to vLLM Semantic Router by verifying assistant claims against tool outputs and grounding context in real time without LLM-as-judge overhead.

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Dec 13, 2025·11 min read

How Speculators v0.3.0 supports end-to-end Eagle3 draft model training for vLLM, including hidden-state data generation, MoE and non-MoE verifiers, offline workflows, and seamless speculative decoding serving.

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

Dec 13, 2025·5 min read

What vLLM Router provides for large-scale serving: Rust-based state-aware load balancing, KV-cache affinity, prefill/decode disaggregation orchestration, Kubernetes discovery, retries, circuit breakers, and Prometheus metrics.

Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor

Dec 9, 2025·5 min read

How Intel AutoRound integrates with LLM Compressor to produce low-bit quantized checkpoints for vLLM, using tuning-based PTQ, W4A16 and related formats, compressed-tensors compatibility, and lightweight calibration.

Tracing Hanging and Complicated GPU Kernels Down To The Source Code

Dec 3, 2025·13 min read

How vLLM developers debug hanging and complex CUDA kernels by triggering GPU core dumps, identifying stuck kernels, and mapping failures back to source code lines for faster kernel debugging.

Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

Nov 30, 2025·4 min read

What vLLM-Omni adds to the vLLM ecosystem: omni-modality serving for text, image, video, and audio, diffusion and non-autoregressive generation support, disaggregated stages, OpenAI-compatible APIs, and pipelined execution.

Streamlined multi-node serving with Ray symmetric-run

Nov 22, 2025·4 min read

How Ray symmetric-run simplifies multi-node vLLM serving by launching the same entrypoint on every Ray cluster node, matching HPC and parallel SSH workflows for distributed model deployments.

Building Clean, Maintainable vLLM Modifications Using the Plugin System

Nov 20, 2025·12 min read

How the vLLM plugin system helps teams customize scheduling, KV-cache behavior, hardware integrations, and model execution without long-lived forks, monkey patches, or brittle internal modifications.

Docker Model Runner Integrates vLLM for High-Throughput Inferencing

Nov 19, 2025·6 min read

How Docker Model Runner integrates vLLM as an inference backend, letting developers run safetensors models with high-throughput serving, PagedAttention, streaming, and OpenAI-compatible APIs from Docker workflows.

Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

Nov 19, 2025·14 min read

How vLLM Semantic Router replaces fixed domain classification with signal-decision architecture, combining multi-dimensional signals, AND/OR decision logic, model selection, and plugin orchestration for production routing.

Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems

Nov 13, 2025·7 min read

How shared memory IPC caching in vLLM reduces redundant data transfers for multimodal and multi-process inference, improving prefill throughput and TTFT by sharing large inputs across coordinator and worker processes.

Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM

Nov 11, 2025·9 min read

How vLLM serves LLMs on Intel Arc Pro B-Series GPUs with MoE optimizations, persistent kernels, multi-GPU scaling, LoRA, speculative decoding, structured outputs, and mixed-precision recipes.

No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan

Nov 10, 2025·6 min read

How vLLM and TorchTitan demonstrate bitwise consistent on-policy RL by matching training and inference numerics, using batch-invariant kernels to reduce train-inference mismatch and stabilize reinforcement learning.

Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM

Oct 31, 2025·4 min read

How to serve NVIDIA Nemotron Nano 2 VL with vLLM for multimodal reasoning agents, including video understanding, document intelligence, Efficient Video Sampling, 128K context, and OpenAI-compatible deployment.

Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM

Oct 28, 2025·11 min read

How vLLM debugged Kimi K2 tool-calling accuracy, covering chat-template compatibility, add_generation_prompt handling, schema validation failures, benchmark fixes, and tool-use reliability.

From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA

Oct 27, 2025·8 min read

How vLLM Semantic Router refactors its Rust classification layer with modular model support, Qwen3-Embedding, EmbeddingGemma, LoRA-based multi-task classification, and concurrent routing execution.

Zero-Reload Model Switching with vLLM Sleep Mode

Oct 26, 2025·17 min read

How vLLM Sleep Mode enables fast model switching by hibernating weights to CPU RAM or discarding them while preserving process state, CUDA graphs, allocators, and kernel warmup to avoid full reloads.

Now Serving NVIDIA Nemotron with vLLM

Oct 23, 2025·5 min read

How vLLM serves NVIDIA Nemotron Nano 2 for agentic reasoning, including hybrid Transformer-Mamba architecture, thinking budget control, open weights and data, throughput benefits, and deployment commands.

No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL

Oct 22, 2025·9 min read

How vLLM's OpenAI-compatible API can return prompt and response token IDs to prevent retokenization drift in agent reinforcement learning, preserving exact sampled sequences for stable on-policy updates.

vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU

Oct 16, 2025·11 min read

How the redesigned vLLM TPU backend uses tpu-inference, JAX-to-XLA lowering, Torchax, ragged paged attention, and unified PyTorch and JAX support to improve TPU performance and model coverage.

SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference

Oct 9, 2025·8 min read

How vLLM and NVIDIA optimize Blackwell inference for SemiAnalysis InferenceMAX, improving gpt-oss 120B and Llama 3.3 70B throughput with FP4 kernels, scheduling work, and Pareto-frontier benchmarking.

DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action

Sep 29, 2025·8 min read

How vLLM supports DeepSeek-V3.2-Exp with DeepSeek Sparse Attention, lightning indexer caches, separate prefill and decode paths, FlashMLA sparse attention, DeepGEMM kernels, and Blackwell deployment.

The First vLLM Meetup in Korea

Sep 16, 2025·5 min read

What the first vLLM Korea meetup covered: community adoption, llm-d, TPU integration, contribution workflows, hardware plugin architecture, Rebellions NPU work, and production inference lessons.

vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency

Sep 11, 2025·3 min read

How vLLM supports Qwen3-Next with hybrid attention, Gated DeltaNet, full attention, high-sparsity MoE, multi-token prediction, hybrid KV cache management, Triton kernels, and CUDA graphs.

vLLM Semantic Router: Next Phase in LLM inference

Sep 11, 2025·4 min read

How vLLM Semantic Router routes requests by intent, covering semantic classification, smart reasoning-path selection, Rust and Candle execution, and Kubernetes Envoy integration for efficient inference.

Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM

Sep 5, 2025·10 min read

How vLLM expands beyond text generation to serve geospatial, vision, and other non-autoregressive models with pooling-model support, TerraTorch integration, raw tensor handling, and flexible IO processors.

Introduction to torch.compile and How It Works with vLLM

Aug 20, 2025·14 min read

How torch.compile works inside vLLM, including TorchDynamo graph capture, TorchInductor code generation, custom compiler passes, graph breaks, model compilation strategy, and performance optimization.

GLM-4.5 Meets vLLM: Built for Intelligent Agents

Aug 19, 2025·4 min read

How to run GLM-4.5 and GLM-4.5V with vLLM for intelligent agents, including hybrid reasoning modes, FP8 and BF16 serving, multimodal support, and NVIDIA Blackwell and Hopper deployment.

CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond

Aug 11, 2025·16 min read

How to debug vLLM CUDA illegal memory access errors with CUDA core dumps, environment variables, cuda-gdb, and GPU state inspection when Python stack traces or CUDA_LAUNCH_BLOCKING are insufficient.

vLLM Now Supports gpt-oss

Aug 5, 2025·5 min read

How vLLM supports gpt-oss 20B and 120B on NVIDIA Blackwell, Hopper, and AMD GPUs, with MXFP4 MoE kernels, efficient attention, hybrid KV cache allocation, and built-in tool support.

MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference

Jun 30, 2025·6 min read

How vLLM serves MiniMax-M1's hybrid MoE architecture for long-context inference, covering model deployment, memory management, batched serving, backend optimizations, and Docker-based setup.

Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU

May 12, 2025·5 min read

How vLLM hardware plugins decouple backend integrations from core vLLM, using Platform, Executor, Worker, ModelRunner, AttentionBackend, and Communicator hooks to support Ascend NPU and IBM Spyre.

Accelerating RLHF with vLLM, Best Practice from OpenRLHF

Apr 23, 2025·5 min read

How OpenRLHF uses vLLM, Ray, ZeRO-3, AutoTP, Ray placement groups, and weight synchronization to accelerate PPO and RLHF sample generation for reasoning models with long chain-of-thought outputs.

Transformers modeling backend integration in vLLM

Apr 11, 2025·6 min read

How vLLM integrates the Hugging Face Transformers modeling backend to serve more model architectures efficiently, including text and vision-language models through model_impl="transformers".

Llama 4 in vLLM

Apr 5, 2025·4 min read

How vLLM serves Meta Llama 4 Scout and Maverick multimodal MoE models with long-context support, tensor parallel deployment, H100 and H200 guidance, FP8 variants, and performance tips.

PTPC-FP8: Boosting vLLM Performance on AMD ROCm

Feb 24, 2025·9 min read

How PTPC-FP8 quantization improves vLLM performance on AMD ROCm by combining per-token activation scaling and per-channel weight scaling for near-BF16 accuracy with FP8 speed.

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

Feb 21, 2025·4 min read

What AIBrix adds as a Kubernetes control plane for vLLM: LoRA management, LLM gateway routing, autoscaling, unified runtime, distributed inference, distributed KV cache, heterogeneous serving, and GPU failure detection.

Distributed Inference with vLLM

Feb 17, 2025·5 min read

A guide to distributed inference in vLLM, covering tensor parallelism, pipeline parallelism, multi-GPU and multi-node serving, KV cache challenges, speculative decoding, communication kernels, and control-plane design.

Introducing vLLM Inference Provider in Llama Stack

Jan 27, 2025·8 min read

How Llama Stack integrates vLLM as an inference provider through remote and inline providers, enabling OpenAI-compatible vLLM serving for local and Kubernetes generative AI application deployments.

vLLM V1: A Major Upgrade to vLLM's Core Architecture

Jan 27, 2025·11 min read

What changed in vLLM V1: a re-architected engine with a simpler scheduler, near-zero-overhead prefix caching, cleaner tensor parallelism, multiprocessing API server, and default optimizations for higher-throughput serving.

High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack

Jan 21, 2025·4 min read

What vLLM production-stack adds for Kubernetes serving: prefix-aware routing, LMCache-backed KV cache sharing, autoscaling, observability, fault tolerance, and cluster deployment with higher throughput and lower latency.

Structured Decoding in vLLM: a gentle introduction

Jan 14, 2025·12 min read

How structured decoding works in vLLM, covering JSON outputs, grammar-guided generation, outlines, XGrammar, TPOT improvements, constrained decoding, and agentic workflow use cases.

Installing and Developing vLLM with Ease

Jan 10, 2025·5 min read

How to install and develop vLLM using stable releases, nightly wheels, uv, source builds, Python and C++/CUDA workflows, and version tracking for production deployments.

vLLM 2024 Retrospective and 2025 Vision

Jan 10, 2025·11 min read

A vLLM 2024 retrospective and 2025 roadmap covering community growth, model and hardware support, production adoption, office hours, ecosystem partnerships, and the path toward universal open-source serving.

Serving LLMs on AMD MI300X: Best Practices

Oct 23, 2024·15 min read

Best practices for serving LLMs with vLLM on AMD MI300X, covering ROCm setup, Llama 3.1 70B and 405B benchmarks, chunked prefill, multi-step scheduling, prefix caching, graph capture, and AMD tuning.

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Oct 17, 2024·10 min read

How speculative decoding works in vLLM, covering EAGLE, Medusa, n-gram proposals, draft and target runners, scheduler and memory-manager changes, and continuous batching for lower token latency.

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

Sep 5, 2024·12 min read

What changed in vLLM v0.6.0 to improve throughput and latency, including API server isolation, reduced CPU overhead, multi-step scheduling, async execution, and benchmarks against earlier vLLM versions.

vLLM’s Open Governance and Performance Roadmap

Jul 25, 2024·4 min read

vLLM's open governance and performance roadmap, covering LF AI and Data incubation, public benchmarks, optimized kernels, async scheduling, API frontend overhead, torch.compile, disaggregated prefill, and community research.

Announcing Llama 3.1 Support in vLLM

Jul 23, 2024·6 min read

How vLLM supports Meta Llama 3.1 models, including 128K context, Llama 3.1 405B serving, chunked prefill, FP8 quantization, tensor and pipeline parallelism, CPU offloading, and early performance results.

Notes on vLLM v.s. DeepSpeed-FastGen

Nov 14, 2023·4 min read

A performance comparison of vLLM and DeepSpeed-FastGen, explaining when Dynamic SplitFuse helps, where vLLM is faster, and how memory allocation, output length, and workload shape affect throughput.

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Jun 20, 2023·8 min read

What the original vLLM launch announced: PagedAttention for KV cache management, up to 24x throughput over Hugging Face Transformers, and lower-cost high-throughput LLM serving.