Menu

Theme

All Posts

95 articles in chronological order.

2026(37)

Jun 2
Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon LLM Agents
Xunzhuo Liu, Bowei He, Huamin Chen, Haichen Zhang (AMD), Andy Luo (AMD), and the vLLM Semantic Router Team
Jun 2
Accelerating vLLM-Omni Inference with AutoRound Quantization
vLLM-Omni Community, Intel AutoRound Team
Jun 1
vLLM on the DGX Spark: Architecture, Configuration, and Local Evaluation
Inferact
May 28
Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor
Megan Flynn, Dipika Sikka, Alexandre Marques
May 28
Native RL APIs in vLLM
Aaron Hao, Sumanth Hegde, Kyle Sayers, Kourosh Hakhamaneshi, and the vLLM team
May 28
Speculators v0.5.0: DFlash Support and Online Training
Fynn Schmitt-Ulms, Helen Zhao, Rahul Tuli and Dipika Sikka (Red Hat AI Model Optimization Team)
May 28
From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router
David Shrader, Huamin Chen, Xunzhuo Liu, Bowei He, and the vLLM Semantic Router Team
May 26
EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec
EAGLE Team, vLLM Team, and TorchSpec Team
May 18
vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache
Novita AI and the vLLM Team
May 14
Elastic Expert Parallelism in vLLM
Itay Alroy (NVIDIA), Yongji Wu (Sky Computing), Rui Qiao (Anyscale), Tyler Michael Smith (Red Hat), Moein Khazraee (NVIDIA), Omri Kahalon (NVIDIA), Tzu-Ling Kan (NVIDIA), Ron Tourgeman (NVIDIA)
May 14
Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models
VeRL-Omni Team
May 11
A First Comprehensive Study of TurboQuant: Accuracy and Performance
Eldar Kurtić, Michael Goin, Alexandre Marques (Red Hat AI)
May 11
vLLM Tops the Artificial Analysis Leaderboard
vLLM Team
May 6
Serving Agentic Workloads at Scale with vLLM x Mooncake
Yifan Qiao, Trong Dao Le, Ao Shen, Zhewen Li, Bowen Wang
Apr 28
Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM
NVIDIA Nemotron Team
Apr 24
DeepSeek V4 in vLLM: Efficient Long-context Attention
vLLM Team
Apr 22
The State of FP8 KV-Cache and Attention Quantization in vLLM
Jonas Kübler* (AWS), Eldar Kurtić* (Red Hat AI), Lucas Wilkinson (Red Hat AI), Matthew Bonanni (Red Hat AI), Michael Goin (Red Hat AI), Alexandre Marques (Red Hat AI), Kailash Budhathoki (AWS) (* Equal Contribution)
Apr 21
Disaggregated Serving for Hybrid SSM Models in vLLM
Nicolò Lucchesi, Zhanqiu Hu (Red Hat), and the vLLM team
Apr 14
vLLM Korea Meetup 2026 Wrap-Up
vLLM Team
Apr 7
Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation
AMD and Embedded LLM
Apr 2
Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models
Google Team
Mar 30
Extracting hidden states from vLLM
Fynn Schmitt-Ulms
Mar 24
Model Runner V2: A Modular and Faster Core for vLLM
vLLM Team
Mar 13
P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM
Amazon and NVIDIA Team
Mar 11
Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM
NVIDIA Nemotron Team
Mar 10
vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain
vLLM Semantic Router Team
Mar 4
vLLM Triton Attention Backend Deep Dive
vLLM Team at IBM Research
Feb 27
Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm
AMD and Embedded LLM
Feb 26
Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock
Danielle Maddix Robinson, Florian Saupe, George Novack, Haipeng Li, Mani Kumar Adari, Xiang Song, Yu Gong (AWS AI Team)
Feb 13
DeepSeek-V3.2 on GB300: Performance Breakthrough
The DaoCloud and vLLM team
Feb 3
Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)
Meta and NVIDIA Team
Feb 1
GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier
The vLLM and NVIDIA team
Jan 31
Streaming Requests & Realtime API in vLLM
Meta, Mistral AI as well as the vLLM team
Jan 23
Building Mixture-of-Models on AMD GPUs with vLLM-SR
The AMD and vLLM Semantic Router Team
Jan 8
Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput
Or Ozeri, Danny Harnik (vLLM Team at IBM Research)
Jan 5
vLLM Semantic Router v0.1 Iris: The First Major Release
vLLM Semantic Router Team
Jan 2
Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers
micytao

2025(51)

2024(5)

2023(2)