#model-support

17 posts

Run Highly Efficient Multimodal Agentic AI with NVIDIA Nemotron 3 Nano Omni Using vLLM

Apr 28, 2026·7 min read

We are excited to support the newly released NVIDIA Nemotron 3 Nano Omni model on vLLM.

DeepSeek V4 in vLLM: Efficient Long-context Attention

Apr 24, 2026·17 min read

A first-principles walkthrough of DeepSeek V4's long-context attention, and how we implemented it in vLLM.

Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models

Apr 2, 2026·3 min read

With the debut of Gemma 4, vLLM introduces immediate support for Google's most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,...

Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM

Mar 11, 2026·5 min read

We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM.

Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM

Dec 15, 2025·5 min read

Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation...

Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM

Oct 31, 2025·4 min read

We are excited to release NVIDIA Nemotron Nano 2 VL, supported by vLLM. This open vision language model (VLM) is built for video understanding and document intelligence.

Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM

Oct 28, 2025·11 min read

TL;DR: For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8863059dd8afc00937f054e1365abbd (Kimi-K2-0905) or commit...

Now Serving NVIDIA Nemotron with vLLM

Oct 23, 2025·5 min read

Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in developer applications. To build these systems, developers need tools that are...

DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action

Sep 29, 2025·8 min read

We are excited to announce Day 0 support for DeepSeek-V3.2-Exp, featuring DeepSeek Sparse Attention (DSA) (paper) designed for long context tasks. In this post, we showcase how to use this model...

vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency

Sep 11, 2025·3 min read

We’re excited to announce that vLLM now supports Qwen3-Next, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a hybrid architecture with extreme efficiency for...

GLM-4.5 Meets vLLM: Built for Intelligent Agents

Aug 19, 2025·4 min read

General Language Model (GLM) is a family of foundation models created by Zhipu.ai (now renamed to Z.ai). The GLM team has long-term collaboration with vLLM team, dating back to the early days of...

vLLM Now Supports gpt-oss

Aug 5, 2025·5 min read

We're thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model...

MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference

Jun 30, 2025·6 min read

This article explores how MiniMax-M1's hybrid architecture is efficiently supported in vLLM. We discuss the model's unique features, the challenges of efficient inference, and the technical...

Transformers modeling backend integration in vLLM

Apr 11, 2025·6 min read

The Hugging Face Transformers library offers a flexible, unified interface to a vast ecosystem of model architectures. From research to fine-tuning on custom dataset, Transformers is the go-to...

Llama 4 in vLLM

Apr 5, 2025·4 min read

We're excited to announce that vLLM now supports the Llama 4 herd of models: Scout (17B-16E) and Maverick (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10...

Introducing vLLM Inference Provider in Llama Stack

Jan 27, 2025·8 min read

We are excited to announce that vLLM inference provider is now available in Llama Stack through the collaboration between the Red Hat AI Engineering team and the Llama Stack team from Meta. This...

Announcing Llama 3.1 Support in vLLM

Jul 23, 2024·6 min read

Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting new features with longer context length (up to 128K...