vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from requests, responses, and context to make intelligent routing decisions—including model selection, safety filtering (jailbreak, PII), semantic caching, and hallucination detection. For more background, see our initial announcement blog post.
We are thrilled to announce the release of vLLM Semantic Router v0.1, codename Iris—our first major release that marks a transformative milestone for intelligent LLM routing. Since our experimental launch in September 2025, we've witnessed extraordinary community growth: over 600 Pull Requests merged, 300+ Issues addressed, and contributions from more than 50 outstanding engineers worldwide. As we kick off 2026, we're excited to deliver a production-ready semantic routing platform that has evolved dramatically from its origins.
Why Iris?
In Greek mythology, Iris (Ἶρις) served as the divine messenger who bridged the realms of gods and mortals, traveling on the arc of the rainbow to deliver messages across vast distances. This symbolism perfectly captures what vLLM Semantic Router v0.1 achieves: a bridge between users and diverse AI models, intelligently routing requests across different LLM providers and architectures.
Before: The early Semantic Router relied on a single-dimensional approach—classifying queries into one of 14 MMLU domain categories with statically orchestrated jailbreak, PII, and semantic caching capabilities.
Now: We've introduced the Signal-Decision Driven Plugin Chain Architecture, a complete reimagining of semantic routing that scales from 14 fixed categories to unlimited intelligent routing decisions.
The new architecture extracts six types of signals from user queries:
Domain Signals: MMLU-trained classification with LoRA extensibility
Embedding Signals: Scalable semantic similarity using neural embeddings
Factual Signals: Fact-check classification for hallucination detection
Feedback Signals: User satisfaction/dissatisfaction indicators
Preference Signals: Personalization based on user defined preferences
These signals serve as inputs to a flexible decision engine that combines them using AND/OR logic with priority-based selection. Previously static features like jailbreak detection, PII protection, and semantic caching are now configurable plugins that users can enable per-decision:
Plugin
Purpose
semantic-cache
Cache similar queries for cost optimization
jailbreak
Detect prompt injection attacks
pii
Protect sensitive information
hallucination
Real-time hallucination detection
system_prompt
Inject custom instructions
header_mutation
Modify HTTP headers for metadata propagation
This modular design enables unlimited extensibility—new signals, plugins, and model selection algorithms can be added without architectural changes. Learn more in our Signal-Decision Architecture blog post.
In collaboration with the Hugging Face Candle team, we've completely refactored the router's inference kernel. The previous implementation required loading and running multiple fine-tuned models independently—computational cost grew linearly with the number of classification tasks.
The breakthrough: By adopting Low-Rank Adaptation (LoRA), we now share base model computation across all classification tasks:
Approach
Workload
Scalability
Before
N full model forward passes
O(n)
After
1 base model pass + N lightweight LoRA adapters
O(1) + O(n×ε)
Note: Here ε represents the relative cost of a LoRA adapter forward pass compared to the full base model—typically ε << 1, making the additional overhead negligible.
This architecture delivers significant latency reduction while enabling multi-task classification on the same input. See the full technical details in our Modular LoRA blog post.
Stage 1: HaluGate Sentinel – Binary classification determining if a query warrants factual verification (creative writing and code don't need fact-checking).
Stage 2: HaluGate Detector – Token-level detection identifying exactly which tokens in the response are unsupported by the provided context.
Stage 3: HaluGate Explainer – NLI-based classification explaining why each flagged span is problematic (CONTRADICTION vs NEUTRAL).
HaluGate integrates seamlessly with function-calling workflows—tool results serve as ground truth for verification. Detection results are propagated via HTTP headers, enabling downstream systems to implement custom policies. Dive deeper in our HaluGate blog post.
4. UX Improvements: One-Command Installation
Local Development:
pip install vllm-sr
Get started in seconds with a single pip command. The package includes all core dependencies for quickstart.
Configuration: After installation, run vllm-sr init to generate the default config.yaml. Then configure your LLM backends in the providers section:
Production-ready Helm charts with sensible defaults and extensive customization options. It helps you deploy vLLM Semantic Router in Kubernetes with ease.
Dashboard: A comprehensive web console for managing intelligent routing policies, model configurations, and an interactive chat playground for testing routing decisions in real-time. Visualize routing flows, monitor latency distributions, and fine-tune classification thresholds—all from an intuitive browser-based interface.
5. Ecosystem Integration
vLLM Semantic Router v0.1 integrates seamlessly with the broader AI infrastructure ecosystem:
Inference Frameworks:
vLLM Production Stack – Reference stack for production vLLM deployment with Helm charts, request routing, and KV cache offloading
NVIDIA Dynamo – Datacenter-scale distributed inference framework for multi-GPU, multi-node serving with disaggregated prefill/decode
llm-d – Kubernetes-native distributed inference stack for achieving SOTA performance across accelerators (NVIDIA, AMD, Google TPU, Intel XPU)
vLLM AIBrix – Open-source GenAI infrastructure building blocks for scalable LLM serving
API Gateways:
Envoy AI Gateway – Unified access to generative AI services built on Envoy Gateway with multi-provider support
Istio – Open-source service mesh for enterprise deployments with traffic management, security, and observability
6. MoM (Mixture of Models) Family
We're proud to introduce the MoM Family—a comprehensive suite of specialized models purpose-built for semantic routing:
Model
Purpose
mom-domain-classifier
MMLU-based domain classification
mom-pii-classifier
PII detection and protection
mom-jailbreak-classifier
Prompt injection detection
mom-halugate-sentinel
Fact-check classification
mom-halugate-detector
Token-level hallucination detection
mom-halugate-explainer
NLI-based explanation
mom-toolcall-sentinel
Tool selection classification
mom-toolcall-verifier
Tool call verification
mom-feedback-detector
User feedback analysis
mom-embedding-x
Semantic embedding extraction
All MoM models are specifically trained and optimized for vLLM Semantic Router, providing consistent performance across routing scenarios.
7. Responses API Support
We now support the OpenAI Responses API (/v1/responses) with in-memory conversation state management:
Stateful Conversations: Built-in state management with previous_response_id chaining
Multi-turn Context: Automatic context preservation across conversation turns
Routing Continuity: Intent classification history maintained across the conversation
This enables intelligent routing for modern agent frameworks and multi-turn applications.
8. Tool Selection
Intelligent tool management for agentic workflows:
Semantic Tool Filtering: Automatically filter irrelevant tools before sending to LLM
Context-Aware Selection: Consider conversation history and task requirements
Reduced Token Usage: Smaller tool catalogs mean faster inference and lower costs
Looking Ahead: v0.2 Roadmap
While v0.1 Iris establishes a solid foundation, we're already planning significant enhancements for v0.2:
Signal-Decision Architecture Enhancements
More Signal Types: Extract additional valuable signals from user queries
Improved Accuracy: Enhance existing signal computation precision
Signal Composer: Design a signal composition layer for complex signal extraction and improved performance
Model Selection Algorithms
Building on the Signal-Decision foundation, we're researching intelligent model selection algorithms:
Tool Completion: Auto-complete tool definitions and calling based on intents.
Advanced Tool Filtering: More sophisticated relevance filtering
UX & Operations
Dashboard Enhancements: Improved visualization and management capabilities
Helm Chart Improvements: More configuration options and deployment patterns
Evaluation
Working with RouterArena Team on comprehensive router evaluation frameworks
Acknowledgments
vLLM Semantic Router v0.1 Iris represents a truly global collaboration. We gratefully acknowledge the contributions from organizations including Red Hat, IBM Research, AMD, Hugging Face, and many others.
We're proud to welcome our growing committer community:
And to the 50+ contributors who helped make this release possible—thank you!
Get Started
Ready to try vLLM Semantic Router v0.1 Iris?
pip install vllm-sr
Join the Community
We believe the future of intelligent routing is built together. Whether you're a company looking to integrate intelligent routing into your AI infrastructure, a researcher exploring new frontiers in semantic understanding, or an individual developer passionate about open-source AI—we welcome your participation.
Ways to contribute:
Organizations: Partner with us on integrations, sponsor development, or contribute engineering resources
Researchers: Collaborate on papers, propose new algorithms, or help benchmark performance
Developers: Submit PRs, report issues, improve documentation, or build community plugins
Community: Share use cases, write tutorials, translate docs, or help answer questions
Every contribution matters—from fixing a typo to architecting a new feature. Join us in shaping the next generation of semantic routing infrastructure.