<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>vLLM Blog</title>
    <link>https://vllm.ai/blog</link>
    <description>Technical articles, release announcements, model guides, and community updates from the vLLM project.</description>
    <language>en-us</language>
    <lastBuildDate>Wed, 10 Jun 2026 08:46:07 GMT</lastBuildDate>
    <atom:link href="https://vllm.ai/blog/rss.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>Announcing vime: A Simple, Stable, and Efficient RL Framework for LLMs</title>
      <link>https://vllm.ai/blog/2026-06-09-announcing-vime</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-06-09-announcing-vime</guid>
      <pubDate>Tue, 09 Jun 2026 00:00:00 GMT</pubDate>
      <description>vime connects slime&apos;s training stack with vLLM rollouts to provide a simple, stable, and efficient RL post-training pipeline.</description>
      <category>reinforcement-learning</category>
      <category>ecosystem</category>
      <category>post-training</category>
      <dc:creator>vime Contributors and the vLLM Team</dc:creator>
    </item>
    <item>
      <title>vLLM Semantic Router v0.3 Themis: From Signals to Stateful Production Routing</title>
      <link>https://vllm.ai/blog/2026-06-05-v0.3-vllm-sr-themis-release</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-06-05-v0.3-vllm-sr-themis-release</guid>
      <pubDate>Fri, 05 Jun 2026 00:00:00 GMT</pubDate>
      <description>vLLM Semantic Router v0.3, codename Themis, is where semantic routing becomes stateful, observable, and production-ready for real AI traffic.</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Announcing Day-0 Support for NVIDIA Nemotron 3 Ultra on vLLM</title>
      <link>https://vllm.ai/blog/2026-06-04-nemotron-3-ultra-vllm</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-06-04-nemotron-3-ultra-vllm</guid>
      <pubDate>Thu, 04 Jun 2026 00:00:00 GMT</pubDate>
      <description>We are excited to announce Day-0 Support for the newly released NVIDIA Nemotron 3 Ultra on vLLM.</description>
      <category>model-support</category>
      <dc:creator>NVIDIA Nemotron Team</dc:creator>
    </item>
    <item>
      <title>Fast &amp; Efficient LLM Inference with vLLM: A New Course with DeepLearning.AI</title>
      <link>https://vllm.ai/blog/2026-06-03-deeplearning-ai-vllm-course</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-06-03-deeplearning-ai-vllm-course</guid>
      <pubDate>Wed, 03 Jun 2026 00:00:00 GMT</pubDate>
      <description>We&apos;re excited to announce, with Red Hat and Andrew Ng&apos;s DeepLearning.AI, a hands-on course that walks through LLM fundamentals and the full optimize, deploy, and benchmark AI deployment lifecycle...</description>
      <category>community</category>
      <category>ecosystem</category>
      <category>learning</category>
      <dc:creator>Cedric Clyburn</dc:creator>
    </item>
    <item>
      <title>Session-Aware Agentic Routing: Continuity-Aware Model Selection for Long-Horizon LLM Agents</title>
      <link>https://vllm.ai/blog/2026-06-02-session-aware-agentic-routing</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-06-02-session-aware-agentic-routing</guid>
      <pubDate>Tue, 02 Jun 2026 00:00:00 GMT</pubDate>
      <description>Long-horizon LLM agents create a routing problem that single-turn prompt routers were not designed to solve. A router still needs to know which model is best for the current request, but it also...</description>
      <category>ecosystem</category>
      <category>performance</category>
      <category>agentic-routing</category>
      <dc:creator>Xunzhuo Liu, Bowei He, Huamin Chen, Haichen Zhang (AMD), Andy Luo (AMD), and the vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Accelerating vLLM-Omni Inference with AutoRound Quantization</title>
      <link>https://vllm.ai/blog/2026-06-02-vllm-omni-autoround</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-06-02-vllm-omni-autoround</guid>
      <pubDate>Tue, 02 Jun 2026 00:00:00 GMT</pubDate>
      <description>We are excited to announce that AutoRound — Intel&apos;s state-of-the-art post-training quantization (PTQ) algorithm — is now fully integrated into vLLM-Omni, enabling a streamlined quantize-once,...</description>
      <category>quantization</category>
      <category>multimodal</category>
      <category>vllm-omni</category>
      <category>hardware</category>
      <dc:creator>vLLM-Omni Community, Intel AutoRound Team</dc:creator>
    </item>
    <item>
      <title>vLLM on the DGX Spark: Architecture, Configuration, and Local Evaluation</title>
      <link>https://vllm.ai/blog/2026-06-01-vllm-dgx-spark</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-06-01-vllm-dgx-spark</guid>
      <pubDate>Mon, 01 Jun 2026 00:00:00 GMT</pubDate>
      <description>A technical deep dive on running vLLM on NVIDIA DGX Spark and GB10 systems, covering sm_121 architecture, unified memory behavior, NVFP4 model serving, Nemotron-3-Super configuration, Docker deployment, Prometheus metrics, and local evaluation results.</description>
      <category>dgx-spark</category>
      <category>nemotron</category>
      <category>hardware</category>
      <category>deployment</category>
      <category>computex</category>
      <dc:creator>Inferact</dc:creator>
    </item>
    <item>
      <title>Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor</title>
      <link>https://vllm.ai/blog/2026-05-28-laguna-xs2-dflash-llm-compressor</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-28-laguna-xs2-dflash-llm-compressor</guid>
      <pubDate>Thu, 28 May 2026 00:00:00 GMT</pubDate>
      <description>As organizations increasingly adopt AI-powered development tools, the need for high-performance agentic models that deliver both accuracy and operational efficiency has become critical. Laguna...</description>
      <category>quantization</category>
      <category>speculative-decoding</category>
      <category>speculators</category>
      <category>llm-compressor</category>
      <category>dflash</category>
      <dc:creator>Megan Flynn, Dipika Sikka, Alexandre Marques</dc:creator>
    </item>
    <item>
      <title>Native RL APIs in vLLM</title>
      <link>https://vllm.ai/blog/2026-05-28-native-rl-apis</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-28-native-rl-apis</guid>
      <pubDate>Thu, 28 May 2026 00:00:00 GMT</pubDate>
      <description>As post-training workloads continue to scale, we&apos;ve seen widespread adoption of vLLM as the inference engine of choice. However, two issues repeatedly arise:</description>
      <category>reinforcement-learning</category>
      <category>async-rl</category>
      <dc:creator>Aaron Hao, Sumanth Hegde, Kyle Sayers, Kourosh Hakhamaneshi, and the vLLM team</dc:creator>
    </item>
    <item>
      <title>Speculators v0.5.0: DFlash Support and Online Training</title>
      <link>https://vllm.ai/blog/2026-05-28-speculators-v050</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-28-speculators-v050</guid>
      <pubDate>Thu, 28 May 2026 00:00:00 GMT</pubDate>
      <description>The v0.5.0 release brings significant architectural improvements to speculative decoding model training, introducing DFlash algorithm support, fully unified online training capabilities, and a...</description>
      <category>speculative-decoding</category>
      <category>ecosystem</category>
      <dc:creator>Fynn Schmitt-Ulms, Helen Zhao, Rahul Tuli and Dipika Sikka (Red Hat AI Model Optimization Team)</dc:creator>
    </item>
    <item>
      <title>From Text to Multimodal Routing: Hardening Vision Signals in vLLM Semantic Router</title>
      <link>https://vllm.ai/blog/2026-05-28-vllm-sr-vision-encoder-hardening</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-28-vllm-sr-vision-encoder-hardening</guid>
      <pubDate>Thu, 28 May 2026 00:00:00 GMT</pubDate>
      <description>Most routing systems start with a prompt and choose a model endpoint. vLLM Semantic Router (VSR) makes a different bet: before a request reaches the serving model, the system should extract...</description>
      <category>ecosystem</category>
      <category>performance</category>
      <dc:creator>David Shrader, Huamin Chen, Xunzhuo Liu, Bowei He, and the vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec</title>
      <link>https://vllm.ai/blog/2026-05-26-eagle-3-1</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-26-eagle-3-1</guid>
      <pubDate>Tue, 26 May 2026 00:00:00 GMT</pubDate>
      <description>The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and...</description>
      <category>speculative-decoding</category>
      <category>performance</category>
      <dc:creator>EAGLE Team, vLLM Team, and TorchSpec Team</dc:creator>
    </item>
    <item>
      <title>vLLM x Novita AI: PegaFlow for Production-Grade External KV Cache</title>
      <link>https://vllm.ai/blog/2026-05-18-pegaflow</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-18-pegaflow</guid>
      <pubDate>Mon, 18 May 2026 00:00:00 GMT</pubDate>
      <description>TL;DR: In collaboration with Novita AI, PegaFlow integrates with vLLM as an external KV cache service for LLM inference, implemented as a standalone Rust process and connected through the external...</description>
      <category>kv_cache</category>
      <category>disaggregation</category>
      <category>performance</category>
      <category>production-serving</category>
      <dc:creator>Novita AI and the vLLM Team</dc:creator>
    </item>
    <item>
      <title>Elastic Expert Parallelism in vLLM</title>
      <link>https://vllm.ai/blog/2026-05-14-elastic-expert-parallelism</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-14-elastic-expert-parallelism</guid>
      <pubDate>Thu, 14 May 2026 00:00:00 GMT</pubDate>
      <description>Expert parallelism (EP) is a key technique for serving Mixture-of-Experts (MoE) models at high throughput. WideEP deployments (where EP spans many workers) maximize KV cache capacity, enabling...</description>
      <category>large-scale-serving</category>
      <category>elastic-ep</category>
      <category>expert-parallelism</category>
      <category>moe</category>
      <category>fault-tolerance</category>
      <dc:creator>Itay Alroy (NVIDIA), Yongji Wu (Sky Computing), Rui Qiao (Anyscale), Tyler Michael Smith (Red Hat), Moein Khazraee (NVIDIA), Omri Kahalon (NVIDIA), Tzu-Ling Kan (NVIDIA), Ron Tourgeman (NVIDIA)</dc:creator>
    </item>
    <item>
      <title>Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models</title>
      <link>https://vllm.ai/blog/2026-05-14-verl-omni</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-14-verl-omni</guid>
      <pubDate>Thu, 14 May 2026 00:00:00 GMT</pubDate>
      <description>We are excited to announce the pre-release of VeRL-Omni, a general reinforcement learning (RL) post-training framework focused on multimodal generative models, built on top of verl and vllm-omni.</description>
      <category>multimodal</category>
      <category>rlhf</category>
      <category>ecosystem</category>
      <dc:creator>VeRL-Omni Team</dc:creator>
    </item>
    <item>
      <title>A First Comprehensive Study of TurboQuant: Accuracy and Performance</title>
      <link>https://vllm.ai/blog/2026-05-11-turboquant</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-11-turboquant</guid>
      <pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate>
      <description>TurboQuant, a method for KV-cache quantization, recently gained significant traction in the community due to the large advertised savings in GPU memory from very low bit-width quantization of a...</description>
      <category>quantization</category>
      <category>kv_cache</category>
      <category>turboquant</category>
      <dc:creator>Eldar Kurtić, Michael Goin, Alexandre Marques (Red Hat AI)</dc:creator>
    </item>
    <item>
      <title>vLLM Tops the Artificial Analysis Leaderboard</title>
      <link>https://vllm.ai/blog/2026-05-11-vllm-tops-artificial-analysis</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-11-vllm-tops-artificial-analysis</guid>
      <pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate>
      <description>How vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.</description>
      <category>performance</category>
      <category>benchmarking</category>
      <category>kernel-fusion</category>
      <category>speculative-decoding</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>Serving Agentic Workloads at Scale with vLLM x Mooncake</title>
      <link>https://vllm.ai/blog/2026-05-06-mooncake-store</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-05-06-mooncake-store</guid>
      <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
      <description>TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake&apos;s distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...</description>
      <category>agentic</category>
      <category>kv_cache</category>
      <category>large-scale-serving</category>
      <category>disaggregation</category>
      <dc:creator>Yifan Qiao, Trong Dao Le, Ao Shen, Zhewen Li, Bowen Wang</dc:creator>
    </item>
    <item>
      <title>Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM</title>
      <link>https://vllm.ai/blog/2026-04-28-nemotron-omni</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-04-28-nemotron-omni</guid>
      <pubDate>Tue, 28 Apr 2026 00:00:00 GMT</pubDate>
      <description>We are excited to support the newly released NVIDIA Nemotron 3 Nano Omni model on vLLM.</description>
      <category>model-support</category>
      <dc:creator>NVIDIA Nemotron Team</dc:creator>
    </item>
    <item>
      <title>DeepSeek V4 in vLLM: Efficient Long-context Attention</title>
      <link>https://vllm.ai/blog/2026-04-24-deepseek-v4</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-04-24-deepseek-v4</guid>
      <pubDate>Fri, 24 Apr 2026 00:00:00 GMT</pubDate>
      <description>A first-principles walkthrough of DeepSeek V4&apos;s long-context attention, and how we implemented it in vLLM.</description>
      <category>model-support</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>The State of FP8 KV-Cache and Attention Quantization in vLLM</title>
      <link>https://vllm.ai/blog/2026-04-22-fp8-kvcache</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-04-22-fp8-kvcache</guid>
      <pubDate>Wed, 22 Apr 2026 00:00:00 GMT</pubDate>
      <description>Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...</description>
      <category>quantization</category>
      <category>performance</category>
      <category>kv_cache</category>
      <category>fp8</category>
      <dc:creator>Jonas Kübler* (AWS), Eldar Kurtić* (Red Hat AI), Lucas Wilkinson (Red Hat AI), Matthew Bonanni (Red Hat AI), Michael Goin (Red Hat AI), Alexandre Marques (Red Hat AI), Kailash Budhathoki (AWS) (* Equal Contribution)</dc:creator>
    </item>
    <item>
      <title>Disaggregated Serving for Hybrid SSM Models in vLLM</title>
      <link>https://vllm.ai/blog/2026-04-21-hybrid-ssm-disagg</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-04-21-hybrid-ssm-disagg</guid>
      <pubDate>Tue, 21 Apr 2026 00:00:00 GMT</pubDate>
      <description>Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time...</description>
      <category>disaggregation</category>
      <category>mamba</category>
      <dc:creator>Nicolò Lucchesi, Zhanqiu Hu (Red Hat), and the vLLM team</dc:creator>
    </item>
    <item>
      <title>vLLM Korea Meetup 2026 Wrap-Up</title>
      <link>https://vllm.ai/blog/2026-04-14-vllm-korea-meetup-2026</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-04-14-vllm-korea-meetup-2026</guid>
      <pubDate>Tue, 14 Apr 2026 00:00:00 GMT</pubDate>
      <description>Hosted by the vLLM KR Community, with support from Rebellions, SqueezeBits, Red Hat APAC, and PyTorch Korea, the vLLM Korea Meetup 2026 was held in Seoul on April 2nd.</description>
      <category>community</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>Next-Level Inference: Why Your Single-Node vLLM Setup Needs Prefill-Decode Disaggregation</title>
      <link>https://vllm.ai/blog/2026-04-07-moriio-kv-connector</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-04-07-moriio-kv-connector</guid>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <description>TL;DR: Prefill and decode fight over the same GPUs, causing ITL spikes under load. We show how to disaggregate them on a single 8-GPU MI300X node using AMD&apos;s MORI-IO connector — achieving 2.5x...</description>
      <category>disaggregation</category>
      <dc:creator>AMD and Embedded LLM</dc:creator>
    </item>
    <item>
      <title>Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models</title>
      <link>https://vllm.ai/blog/2026-04-02-gemma4</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-04-02-gemma4</guid>
      <pubDate>Thu, 02 Apr 2026 00:00:00 GMT</pubDate>
      <description>With the debut of Gemma 4, vLLM introduces immediate support for Google&apos;s most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,...</description>
      <category>model-support</category>
      <dc:creator>Google Team</dc:creator>
    </item>
    <item>
      <title>Extracting hidden states from vLLM</title>
      <link>https://vllm.ai/blog/2026-03-30-extract-hidden-states</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-03-30-extract-hidden-states</guid>
      <pubDate>Mon, 30 Mar 2026 00:00:00 GMT</pubDate>
      <description>PR #33736 (included in vllm&gt;=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and its...</description>
      <category>speculative-decoding</category>
      <dc:creator>Fynn Schmitt-Ulms</dc:creator>
    </item>
    <item>
      <title>Model Runner V2: A Modular and Faster Core for vLLM</title>
      <link>https://vllm.ai/blog/2026-03-24-mrv2</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-03-24-mrv2</guid>
      <pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate>
      <description>We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...</description>
      <category>performance</category>
      <category>engineering</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM</title>
      <link>https://vllm.ai/blog/2026-03-13-p-eagle</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-03-13-p-eagle</guid>
      <pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate>
      <description>EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...</description>
      <category>performance</category>
      <category>speculative-decoding</category>
      <dc:creator>Amazon and NVIDIA Team</dc:creator>
    </item>
    <item>
      <title>Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM</title>
      <link>https://vllm.ai/blog/2026-03-11-nemotron-3-super</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-03-11-nemotron-3-super</guid>
      <pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate>
      <description>We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM.</description>
      <category>model-support</category>
      <dc:creator>NVIDIA Nemotron Team</dc:creator>
    </item>
    <item>
      <title>vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain</title>
      <link>https://vllm.ai/blog/2026-03-10-v0.2-vllm-sr-athena-release</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-03-10-v0.2-vllm-sr-athena-release</guid>
      <pubDate>Tue, 10 Mar 2026 00:00:00 GMT</pubDate>
      <description>Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>vLLM Triton Attention Backend Deep Dive</title>
      <link>https://vllm.ai/blog/2026-03-04-vllm-triton-backend-deep-dive</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-03-04-vllm-triton-backend-deep-dive</guid>
      <pubDate>Wed, 04 Mar 2026 00:00:00 GMT</pubDate>
      <description>This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....</description>
      <category>performance</category>
      <category>triton</category>
      <category>attention</category>
      <dc:creator>vLLM Team at IBM Research</dc:creator>
    </item>
    <item>
      <title>Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm</title>
      <link>https://vllm.ai/blog/2026-02-27-rocm-attention-backend</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-02-27-rocm-attention-backend</guid>
      <pubDate>Fri, 27 Feb 2026 00:00:00 GMT</pubDate>
      <description>For a long time, enabling AMD support meant &quot;porting&quot;; i.e. just making code run. That era is over.</description>
      <category>performance</category>
      <category>hardware</category>
      <dc:creator>AMD and Embedded LLM</dc:creator>
    </item>
    <item>
      <title>Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock</title>
      <link>https://vllm.ai/blog/2026-02-26-multi-lora</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-02-26-multi-lora</guid>
      <pubDate>Thu, 26 Feb 2026 00:00:00 GMT</pubDate>
      <description>Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...</description>
      <category>performance</category>
      <dc:creator>Danielle Maddix Robinson, Florian Saupe, George Novack, Haipeng Li, Mani Kumar Adari, Xiang Song, Yu Gong (AWS AI Team)</dc:creator>
    </item>
    <item>
      <title>DeepSeek-V3.2 on GB300: Performance Breakthrough</title>
      <link>https://vllm.ai/blog/2026-02-13-gb300-deepseek</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-02-13-gb300-deepseek</guid>
      <pubDate>Fri, 13 Feb 2026 00:00:00 GMT</pubDate>
      <description>DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...</description>
      <category>hardware</category>
      <category>quantization</category>
      <category>performance</category>
      <dc:creator>The DaoCloud and vLLM team</dc:creator>
    </item>
    <item>
      <title>Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)</title>
      <link>https://vllm.ai/blog/2026-02-03-dsr1-gb200-part1</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-02-03-dsr1-gb200-part1</guid>
      <pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate>
      <description>Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA&apos;s GB200 platform. This blog...</description>
      <category>large-scale-serving</category>
      <category>performance</category>
      <category>hardware</category>
      <dc:creator>Meta and NVIDIA Team</dc:creator>
    </item>
    <item>
      <title>GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier</title>
      <link>https://vllm.ai/blog/2026-02-01-gpt-oss-optimizations</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-02-01-gpt-oss-optimizations</guid>
      <pubDate>Sun, 01 Feb 2026 00:00:00 GMT</pubDate>
      <description>TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA&apos;s Blackwell GPUs. Through deep...</description>
      <category>performance</category>
      <category>hardware</category>
      <dc:creator>The vLLM and NVIDIA team</dc:creator>
    </item>
    <item>
      <title>Streaming Requests &amp; Realtime API in vLLM</title>
      <link>https://vllm.ai/blog/2026-01-31-streaming-realtime</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-01-31-streaming-realtime</guid>
      <pubDate>Sat, 31 Jan 2026 00:00:00 GMT</pubDate>
      <description>Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...</description>
      <category>multimodal</category>
      <dc:creator>Meta, Mistral AI as well as the vLLM team</dc:creator>
    </item>
    <item>
      <title>Building Mixture-of-Models on AMD GPUs with vLLM-SR</title>
      <link>https://vllm.ai/blog/2026-01-23-mom-on-amd-gpu</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-01-23-mom-on-amd-gpu</guid>
      <pubDate>Fri, 23 Jan 2026 00:00:00 GMT</pubDate>
      <description>We are working on building the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems.</description>
      <category>hardware</category>
      <category>ecosystem</category>
      <dc:creator>The AMD and vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput</title>
      <link>https://vllm.ai/blog/2026-01-08-kv-offloading-connector</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-01-08-kv-offloading-connector</guid>
      <pubDate>Thu, 08 Jan 2026 00:00:00 GMT</pubDate>
      <description>In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...</description>
      <category>performance</category>
      <dc:creator>Or Ozeri, Danny Harnik (vLLM Team at IBM Research)</dc:creator>
    </item>
    <item>
      <title>vLLM Semantic Router v0.1 Iris: The First Major Release</title>
      <link>https://vllm.ai/blog/2026-01-05-vllm-sr-iris</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-01-05-vllm-sr-iris</guid>
      <pubDate>Mon, 05 Jan 2026 00:00:00 GMT</pubDate>
      <description>vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from...</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers</title>
      <link>https://vllm.ai/blog/2026-01-02-introducing-vllm-playground</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2026-01-02-introducing-vllm-playground</guid>
      <pubDate>Fri, 02 Jan 2026 00:00:00 GMT</pubDate>
      <description>As a passionate vLLM community member who wants to see vLLM thrive and reach even more developers, I&apos;m excited to announce vLLM Playground – a modern, feature-rich web interface for managing and...</description>
      <category>frontend</category>
      <category>ecosystem</category>
      <dc:creator>micytao</dc:creator>
    </item>
    <item>
      <title>Announcing vllm.ai Website and Some Community Updates</title>
      <link>https://vllm.ai/blog/2025-12-27-vllm-ai-website</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-27-vllm-ai-website</guid>
      <pubDate>Sat, 27 Dec 2025 00:00:00 GMT</pubDate>
      <description>For a long time, vllm.ai simply redirected to the vLLM GitHub page. Thanks to our community, we now have a brand-new vllm.ai website, drawing inspiration from the PyTorch website.</description>
      <category>community</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>vLLM-Omni Diffusion Cache Acceleration</title>
      <link>https://vllm.ai/blog/2025-12-19-vllm-omni-diffusion-cache-acceleration</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-19-vllm-omni-diffusion-cache-acceleration</guid>
      <pubDate>Fri, 19 Dec 2025 00:00:00 GMT</pubDate>
      <description>We are thrilled to announce a major performance update for vLLM-Omni.</description>
      <category>multimodal</category>
      <category>performance</category>
      <category>ecosystem</category>
      <dc:creator>vLLM-Omni Team</dc:creator>
    </item>
    <item>
      <title>vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP</title>
      <link>https://vllm.ai/blog/2025-12-17-large-scale-serving</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-17-large-scale-serving</guid>
      <pubDate>Wed, 17 Dec 2025 00:00:00 GMT</pubDate>
      <description>In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...</description>
      <category>large-scale-serving</category>
      <category>performance</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>AMD × vLLM Semantic Router: Building the System Intelligence Together</title>
      <link>https://vllm.ai/blog/2025-12-16-vllm-sr-amd</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-16-vllm-sr-amd</guid>
      <pubDate>Tue, 16 Dec 2025 00:00:00 GMT</pubDate>
      <description>Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...</description>
      <category>hardware</category>
      <category>ecosystem</category>
      <dc:creator>The AMD and vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM</title>
      <link>https://vllm.ai/blog/2025-12-15-run-nvidia-nemotron-3-nano</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-15-run-nvidia-nemotron-3-nano</guid>
      <pubDate>Mon, 15 Dec 2025 00:00:00 GMT</pubDate>
      <description>Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation...</description>
      <category>model-support</category>
      <dc:creator>NVIDIA Nemotron Team</dc:creator>
    </item>
    <item>
      <title>Encoder Disaggregation for Scalable Multimodal Model Serving</title>
      <link>https://vllm.ai/blog/2025-12-15-vllm-epd</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-15-vllm-epd</guid>
      <pubDate>Mon, 15 Dec 2025 00:00:00 GMT</pubDate>
      <description>Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...</description>
      <category>multimodal</category>
      <category>large-scale-serving</category>
      <dc:creator>Multimodality Workstream @ vLLM</dc:creator>
    </item>
    <item>
      <title>Token-Level Truth: Real-Time Hallucination Detection for Production LLMs</title>
      <link>https://vllm.ai/blog/2025-12-14-halugate</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-14-halugate</guid>
      <pubDate>Sun, 14 Dec 2025 00:00:00 GMT</pubDate>
      <description>Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right...</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Diving into speculative decoding training support for vLLM with Speculators v0.3.0</title>
      <link>https://vllm.ai/blog/2025-12-13-speculators-v030</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-13-speculators-v030</guid>
      <pubDate>Sat, 13 Dec 2025 00:00:00 GMT</pubDate>
      <description>- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...</description>
      <category>speculative-decoding</category>
      <category>ecosystem</category>
      <dc:creator>Fynn Schmitt-Ulms, Helen Zhao, Rahul Tuli and Dipika Sikka (Red Hat AI Model Optimization Team)</dc:creator>
    </item>
    <item>
      <title>vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving</title>
      <link>https://vllm.ai/blog/2025-12-13-vllm-router-release</link>
      <guid isPermaLink="true">https://vllm.ai/blog/2025-12-13-vllm-router-release</guid>
      <pubDate>Sat, 13 Dec 2025 00:00:00 GMT</pubDate>
      <description>Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...</description>
      <category>large-scale-serving</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
  </channel>
</rss>