All Posts
84 articles in chronological order.
2026(26)
- A First Comprehensive Study of TurboQuant: Accuracy and Performance
Eldar Kurtić, Michael Goin, Alexandre Marques (Red Hat AI)
- vLLM Tops the Artificial Analysis Leaderboard
vLLM Team
- Serving Agentic Workloads at Scale with vLLM x Mooncake
Yifan Qiao, Trong Dao Le, Ao Shen, Zhewen Li, Bowen Wang
- Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM
NVIDIA Nemotron Team
- DeepSeek V4 in vLLM: Efficient Long-context Attention
vLLM Team
- The State of FP8 KV-Cache and Attention Quantization in vLLM
Jonas Kübler* (AWS), Eldar Kurtić* (Red Hat AI), Lucas Wilkinson (Red Hat AI), Matthew Bonanni (Red Hat AI), Michael Goin (Red Hat AI), Alexandre Marques (Red Hat AI), Kailash Budhathoki (AWS) (* Equal Contribution)
- Disaggregated Serving for Hybrid SSM Models in vLLM
Nicolò Lucchesi, Zhanqiu Hu (Red Hat), and the vLLM team
- vLLM Korea Meetup 2026 Wrap-Up
vLLM Team
- Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation
AMD and Embedded LLM
- Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models
Google Team
- Extracting hidden states from vLLM
Fynn Schmitt-Ulms
- Model Runner V2: A Modular and Faster Core for vLLM
vLLM Team
- P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM
Amazon and NVIDIA Team
- Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM
NVIDIA Nemotron Team
- vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain
vLLM Semantic Router Team
- vLLM Triton Attention Backend Deep Dive
vLLM Team at IBM Research
- Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm
AMD and Embedded LLM
- Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock
Danielle Maddix Robinson, Florian Saupe, George Novack, Haipeng Li, Mani Kumar Adari, Xiang Song, Yu Gong (AWS AI Team)
- DeepSeek-V3.2 on GB300: Performance Breakthrough
The DaoCloud and vLLM team
- Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)
Meta and NVIDIA Team
- GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier
The vLLM and NVIDIA team
- Streaming Requests & Realtime API in vLLM
Meta, Mistral AI as well as the vLLM team
- Building Mixture-of-Models on AMD GPUs with vLLM-SR
The AMD and vLLM Semantic Router Team
- Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput
Or Ozeri, Danny Harnik (vLLM Team at IBM Research)
- vLLM Semantic Router v0.1 Iris: The First Major Release
vLLM Semantic Router Team
- Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers
micytao
2025(51)
- Announcing vllm.ai Website and Some Community Updates
vLLM Team
- vLLM-Omni Diffusion Cache Acceleration
vLLM-Omni Team
- vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP
vLLM Team
- AMD × vLLM Semantic Router: Building the System Intelligence Together
The AMD and vLLM Semantic Router Team
- Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM
NVIDIA Nemotron Team
- Encoder Disaggregation for Scalable Multimodal Model Serving
Multimodality Workstream @ vLLM
- Token-Level Truth: Real-Time Hallucination Detection for Production LLMs
vLLM Semantic Router Team
- Diving into speculative decoding training support for vLLM with Speculators v0.3.0
Fynn Schmitt-Ulms, Helen Zhao, Rahul Tuli and Dipika Sikka (Red Hat AI Model Optimization Team)
- vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving
vLLM Team
- Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor
Intel Neural Compressor Team, Red Hat AI Model Optimization Team
- Tracing Hanging and Complicated GPU Kernels Down To The Source Code
Kaichao You (vLLM)
- Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving
vLLM-Omni Team
- Streamlined multi-node serving with Ray symmetric-run
Richard Liaw (Anyscale/Ray), Kaichao You (vLLM)
- Building Clean, Maintainable vLLM Modifications Using the Plugin System
Dhruvil Bhatt (AWS SageMaker)
- Docker Model Runner Integrates vLLM for High-Throughput Inferencing
Docker Team
- Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale
vLLM Semantic Router Team
- Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems
Donglu Wang (Cohere)
- Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM
Intel vLLM Team
- No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan
vLLM and TorchTitan Teams
- Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM
NVIDIA Nemotron Team
- Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM
Linian Wang (Peking University)
- From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA
Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent)
- Zero-Reload Model Switching with vLLM Sleep Mode
Embedded LLM
- Now Serving NVIDIA Nemotron with vLLM
NVIDIA Nemotron Team
- No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL
The Agent Lightning (AGL) Team
- vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU
Google Team
- SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference
vLLM Team
- DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action
vLLM Team
- The First vLLM Meetup in Korea
vLLM Team
- vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency
The vLLM Team
- vLLM Semantic Router: Next Phase in LLM inference
vLLM Semantic Router Team
- Inside vLLM: Anatomy of a High-Throughput LLM Inference System
Aleksa Gordic
- Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM
Christian Pinto (IBM Research Europe - Dublin), Michele Gazzetti (IBM Research Europe - Dublin), Michael Johnston (IBM Research Europe - Dublin), Maximilien Philippe Marie de Bayser (IBM Research - Brazil)
- Introduction to torch.compile and How It Works with vLLM
Luka Govedič (Red Hat), Richard Zou (Meta), Addie Stevens (Red Hat), Kaichao You (Tsinghua University), Michael Goin (Red Hat), Saša Zelenović (Red Hat)
- GLM-4.5 Meets vLLM: Built for Intelligent Agents
Yuxuan Zhang
- CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond
Kaichao You
- vLLM Now Supports gpt-oss
The vLLM Team
- MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference
MiniMax
- Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU
The Ascend Team on vLLM
- Accelerating RLHF with vLLM, Best Practice from OpenRLHF
The OpenRLHF Team
- Transformers modeling backend integration in vLLM
The Hugging Face Team
- Llama 4 in vLLM
The vLLM Team
- PTPC-FP8: Boosting vLLM Performance on AMD ROCm
AMD and Embedded LLM
- Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM
AIBrix Team
- Distributed Inference with vLLM
vLLM Team
- Introducing vLLM Inference Provider in Llama Stack
Yuan Tang (Red Hat) and Ashwin Bharambe (Meta)
- vLLM V1: A Major Upgrade to vLLM's Core Architecture
vLLM Team
- High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack
LMCache Team
- Structured Decoding in vLLM: a gentle introduction
Guest Post by BentoML and Red Hat
- Installing and Developing vLLM with Ease
vLLM Team
- vLLM 2024 Retrospective and 2025 Vision
vLLM Team
2024(5)
- Serving LLMs on AMD MI300X: Best Practices
Guest Post by Embedded LLM and Hot Aisle Inc.
- How Speculative Decoding Boosts vLLM Performance by up to 2.8x
vLLM Team
- vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction
vLLM Team
- vLLM’s Open Governance and Performance Roadmap
vLLM Team
- Announcing Llama 3.1 Support in vLLM
vLLM Team