vLLM Semantic Router v0.3 Themis: From Signals to Stateful Production Routing
vLLM Semantic Router v0.3, codename Themis, is where semantic routing becomes stateful, observable, and production-ready for real AI traffic.
The previous two releases set the stage. Iris made routing decisions composable. Athena rebuilt the model foundation and expanded the router into memory, safety, model selection, long-context signal handling, OpenClaw orchestration, and AMD ROCm deployment. Themis takes the next step: it makes those capabilities easier to operate, easier to inspect, and harder to misuse.
Since v0.2.0, the project has added more than 350 commits across router core, CLI, dashboard, DSL, Kubernetes, protocol compatibility, model selection, safety, replay, and release readiness. The largest value in v0.3 is not a single feature. It is the convergence of those pieces into one stable contract:
signals become projections, projections feed decisions, decisions choose algorithms, and algorithms select models.
That contract now shows up consistently in the router, the CLI, the dashboard, the DSL, the Helm chart, and the operator-oriented deployment surfaces.

Why Themis?
Themis represents order, rules, and judgment. That is the right symbol for this release.
Semantic routing is only useful in production if operators can answer basic questions:
- Which signals fired?
- Which decision matched?
- Which model-selection algorithm ran?
- Which model was selected?
- Which safety or replay plugin changed the path?
- Which config version produced this behavior?
- Can the same policy be deployed locally, through the dashboard, and in Kubernetes without becoming three different systems?
Themis is about making those answers explicit. v0.3 keeps the ambition of Athena, but puts stronger boundaries around the runtime, the API surface, and the operational workflow.

What's New in v0.3 Themis?
1. A Canonical v0.3 Configuration Contract
The most important Themis change is the new canonical config shape:
version: v0.3
listeners: []
providers: {}
routing: {}
global: {}Before v0.3, users could encounter overlapping layouts across local Docker, dashboard-generated config, Helm values, CRDs, examples, and older docs. Themis makes config.yaml the steady-state file and aligns the system around the same top-level architecture everywhere.
That cleanup also removes vllm-sr init. The new flow is simpler:
- use
vllm-sr servefrom an empty directory for dashboard-first setup - author canonical
config.yamldirectly for YAML-first workflows - migrate older files with
vllm-sr config migrate --config old-config.yaml - import supported provider inventories with
vllm-sr config import
This is a breaking change, but it is the right kind of breaking change for a pre-1.0 router: fewer config dialects, clearer ownership, and a more durable public contract.
The config path is also stricter at the edges. v0.3 warns on unknown YAML fields, keeps canonical config loading covered by tests, aligns Python CLI models with modern Pydantic configuration, and gates classifier assets more explicitly. The goal is simple: typos and stale config shapes should be caught before they become silent routing drift.

2. Signal, Projection, Decision, Algorithm, Model
Themis makes the router's mental model more explicit:
| Layer | What it owns |
|---|---|
| Signal | Extract evidence from the request, response, tools, language, domain, context, modality, identity, or safety classifiers |
| Projection | Normalize raw evidence into policy-ready concepts such as verification, urgency, feedback, or balance |
| Decision | Match named routing policies with priority and explainable conditions |
| Algorithm | Choose among candidate models inside a matched decision |
| Model | Serve the request through the selected backend alias or provider |
This matters because v0.3 adds enough routing intelligence that implicit behavior is no longer acceptable. The router now has richer signal families, projection traces, advanced model-selection algorithms, and response-side plugins. Themis keeps those surfaces programmable without turning routing policy into hidden application code.
The current signal catalog is broad enough to describe not only the latest user prompt, but also safety posture, tool loops, user roles, multimodal intent, conversation shape, structured events, and replayable knowledge-base evidence:
| Signal family | What it captures | Typical use |
|---|---|---|
authz | Role and subject bindings from user or group context | Premium/admin routing, policy-gated models |
complexity | Reasoning difficulty from learned or composed signals | Escalate hard synthesis and multi-step reasoning |
context | Estimated context-window demand | Long-context routing, cost and latency decisions |
conversation | Message and tool-loop shape | Multi-turn, active tool use, developer messages, heavy non-user context |
domain | Learned or configured domain labels | Business, law, health, computer-science routing |
embedding | Semantic similarity against candidate anchors, including text/image/audio query modality | Support intent, clinical intent, multimodal request matching |
event | Structured event metadata, severity, action codes, and temporal urgency | Incident, payment, audit, or operational event routing |
fact_check | Whether a request needs factual verification | Escalate legal, medical, or factual claims |
jailbreak | Prompt-injection and jailbreak evidence, including history-aware scanning | Safety routing and response-side guardrails |
kb | Knowledge-base group or label matches | Privacy policy, containment, frontier reasoning, local standard routes |
keyword | Literal, fuzzy, BM25, or n-gram keyword evidence | Fast route guards, urgent keywords, sensitive terms |
language | Detected language with configurable confidence | Locale-aware routing and multilingual model choice |
modality | AR, diffusion, or mixed text/image execution needs | Choose text-only, image-generation, or multimodal paths |
pii | Sensitive entity policy, including history-aware scanning | Redaction, deny/allow decisions, privacy routes |
preference | User style or behavior preference examples | Terse answers, detailed answers, domain-specific style |
reask | Repeated or rephrased user turns | Detect likely dissatisfaction in prior turns |
structure | Regex, count, sequence, or density features | Many questions, numbered workflows, format-heavy prompts |
user_feedback | User says an answer was wrong or needs clarification | Recover from dissatisfaction or route to stronger models |
Projection outputs are referenced with type: projection, but they are derived routing surfaces rather than another raw signal family. That distinction matters: signals extract evidence, while projections turn evidence into named policy bands such as support_fast, support_balanced, or support_escalated.
The main v0.3 additions are not just more signal names. The release makes signals composable: conversation signals can detect agentic request shape; event signals can route operational payloads; embedding rules can query non-text modalities; and projection outputs can turn noisy evidence into policy-ready bands.
The dashboard topology view, the DSL editor, the compiler/decompiler, and runtime metrics were updated to understand these v0.3 surfaces instead of silently dropping or hiding them.
The policy-authoring surface is also stronger. The routing DSL gained conflict detection, SIGNAL_GROUP, TEST, and TIER authoring constructs, a natural-language-to-DSL pipeline, EMIT retention, and dynamic tool retrieval support. That matters for production teams because Themis policies are not just parsed YAML; they are reviewable routing programs with tests, retained outputs, and safer generation paths.

3. Session-Aware Agentic Routing
Themis includes the first production-ready version of Session-Aware Agentic Routing (SAAR).
Single-turn routing asks:
Which model should handle this prompt?
Agentic routing also has to ask:
Is it safe to switch models inside this session right now?
SAAR adds router-owned session memory, hard locks around tool loops, provider-state portability checks, idle and decision-drift reset boundaries, switch economics, and replayable diagnostics. It keeps the normal Semantic Router pipeline, but wraps model selection with session continuity rules.
This is especially important for coding agents and long-horizon tool loops. A tool result should usually return to the model that asked for the tool. A provider-managed continuation id should not be sent to a different physical backend. A long warm session should not throw away prefix locality just because the latest user message is short.
Themis makes those constraints part of the model-selection policy instead of asking every application to rediscover them.

The key design choice is that SAAR does not replace semantic routing. It adds a stateful guard around the last mile of model selection:
conversationsignals identify multi-turn shape, active tool use, developer messages, and heavy non-user context.session_awareselection evaluates whether a model switch is worth it after considering quality gap, switch margin, stay bias, prefix locality, and remaining-turn priors.- Hard locks stop unsafe switches during active tool loops or provider-state continuations.
- Router-owned memory can retrieve and store route-local facts, preferences, and context without exposing a separate session-state DSL.
- Replay records preserve the reason a session stayed, switched, or reset.
Router memory is the durable complement to session-aware selection. The memory plugin can preserve facts, preferences, and retrieved context under user or session scope; session_aware can then avoid treating every turn as an isolated request. In practice, that means an agent can keep useful continuity without pinning every request to the most expensive model forever.
The reference policy shape is intentionally ordinary YAML:
routing:
signals:
conversation:
- name: active_tool_use
feature:
type: count
source:
type: assistant_tool_cycle
predicate:
gte: 1
decisions:
- name: agentic_session_route
rules:
operator: AND
conditions:
- type: conversation
name: active_tool_use
algorithm:
type: session_aware
session_aware:
base_method: hybrid
tool_loop_hard_lock: true
context_portability_hard_lock: true
prefix_cache_weight: 0.20
handoff_penalty_weight: 1.0
plugins:
- type: memory
configuration:
enabled: true
retrieval_limit: 6
auto_store: true
hybrid_search: trueThat is the part of Themis that matters most for agentic workloads: the router can now reason about continuity, not only classification.
4. Projections Turn Evidence Into Policy
Signals are raw evidence. Projections are where Themis turns that evidence into named, stable policy concepts.
Without projections, a complex policy has to repeat low-level signal details across many decisions: exact embedding rule names, complexity thresholds, context boundaries, and knowledge-base scores. With projections, the router can compute the raw evidence once, derive a reusable output such as support_fast or support_escalated, and let decisions route on that derived concept.
Themis supports three core projection patterns:
partitionschoose one winner from an exclusive family, such as competing support intents.scorescombine declared signals or knowledge-base metrics into a continuous value.mappingsturn those values into policy bands through calibrated thresholds.
For policies that need more than one derived output, v0.3 also adds multi_emit projection mappings. That lets a single projection step emit multiple named routing concepts while still preserving traceability in replay.

A compact example looks like this:
routing:
signals:
embeddings:
- name: technical_support
threshold: 0.75
aggregation_method: max
candidates:
- installation guide
- troubleshooting steps
- name: account_management
threshold: 0.72
aggregation_method: any
candidates:
- password reset
- billing information
context:
- name: long_context
min_tokens: 32K
max_tokens: 256K
projections:
partitions:
- name: support_intents
semantics: exclusive
members:
- technical_support
- account_management
default: technical_support
scores:
- name: request_difficulty
method: weighted_sum
inputs:
- type: embedding
name: technical_support
weight: 0.18
value_source: confidence
- type: context
name: long_context
weight: 0.18
mappings:
- name: request_band
source: request_difficulty
method: threshold_bands
outputs:
- name: support_fast
lte: 0.20
- name: support_escalated
gte: 0.45
decisions:
- name: escalated_support_route
rules:
operator: AND
conditions:
- type: projection
name: support_escalatedProjection traces are also stored with replay records, so the dashboard can explain not only which signal fired, but also which derived policy band caused the final route.
5. Protocol Compatibility Becomes a Release Surface
v0.3 expands the router's compatibility boundary beyond basic OpenAI Chat Completions.
The protocol work in this cycle includes:
- native Anthropic
/v1/messagesingress through an internal request envelope - Anthropic streaming with OpenAI SSE translation
- custom Anthropic upstream routing and tool-calling support
- outbound Anthropic response emission for non-streaming paths
- protocol detection from request path headers
- session-id mirroring and header pass-through controls
- response headers that explain when protocol translation is lossy
- Responses API tool-trace fidelity and OpenAI SDK-aligned message handling
- OpenAI reasoning-effort mutation fixes
- identity-encoded upstream responses to avoid transparent decompression surprises
- stronger Responses API state and persistence paths
The goal is not to make every provider look identical. The goal is to make translation explicit, observable, and safe enough that a logical routing model such as auto can sit in front of multiple provider protocols without surprising operators.
6. The Dashboard Becomes an Operator Console
The Themis dashboard is more than a config editor.
The v0.3 cycle tightens the first-run setup flow, topology graph, replay-backed insights, logs, status pages, evaluation flows, auth behavior, and model inventory surfaces. Operators can import a profile, validate it, activate it, send test prompts, inspect signal paths, read router logs, and verify replay records without leaving the dashboard.

Notable dashboard improvements include:
- built-in routing modes and missing-model completion
- topology dry-run paths that show matched signals, projections, decisions, and models
- router replay and aggregate insights through the dashboard proxy
- natural-language DSL builder and evaluation-flow fixes
- file attachments in the playground
- auth fail-closed behavior when the auth service cannot initialize
- policy version lifecycle with shadow, activate, and revert states
- safer logs and URL redaction for user-supplied fetch/open-web requests
- UTF-8-safe display handling for multilingual content
- slimmer production route shell and smaller backend runtime dependencies
- dashboard-aware model list and status surfaces
The result is a better local and remote operator workflow: setup mode for first run, topology for policy inspection, logs/status for operations, and insights for real traffic.
7. CLI and Deployment Are More Predictable
Themis also strengthens vllm-sr as the supported operating interface.
The CLI now has clearer runtime boundaries and more useful commands:
vllm-sr serve
vllm-sr serve --algorithm latency_aware
vllm-sr serve --algorithm session_aware
vllm-sr serve --platform amd
vllm-sr serve --platform nvidia
vllm-sr chat
vllm-sr eval
vllm-sr model list
vllm-sr config migrate --config old-config.yamlLocal vllm-sr serve remains a Docker-based workflow on Linux, macOS, and WSL2. AMD ROCm remains the release-validated GPU path, while --platform nvidia adds local NVIDIA Docker passthrough ergonomics for users who already have the NVIDIA container runtime configured. Native Windows Docker serving is now rejected with an explicit support message rather than failing later in less obvious ways.
The CLI also grows better inspection and smoke-test commands. vllm-sr model list surfaces configured model inventory, vllm-sr chat provides a one-shot completion path, vllm-sr eval exercises router evaluation endpoints, and VLLM_SR_DNS lets local containers join custom DNS environments when enterprise or lab networks require it.
On Kubernetes, v0.3 aligns Helm, release defaults, OpenShift deployment fixes, multiple IntelligentRoute reconcile behavior, CRD modality contracts, optional Gateway API HTTPRoute ingress, and AgentGateway installation guidance. For release operations, Themis also moves away from vague latest assumptions and toward explicit artifact contracts, upgrade and rollback documentation, and release checks.
8. Safety, Replay, Memory, and Retrieval Are More Trustworthy
Athena brought many of these capabilities into the router. Themis hardens them.
Key runtime fixes and improvements now fall into three groups:
Replay and observability
- router replay PostgreSQL insert correctness so dashboard insights do not silently stay empty
- projection traces stored with replay records for better explainability
- response-side jailbreak and replay path tightening
Storage and retrieval
- Qdrant vector search provider support
- Valkey cache, vector store, and memory backend support, including TLS and search-module prechecks
- Redis and Responses API storage defaults that better match real local and Kubernetes deployments
- hybrid cache rebuild preallocation reduction
- streaming Redis semantic-cache correctness and bounded streaming chunk memory behavior
- O(N) cache-LRU read paths replaced with a constant-time list-backed implementation
- BM25 and n-gram classification caching to avoid amplified work
- hybrid HNSW entry-point propagation fixes
- shared Milvus lifecycle handling across replay, cache, memory, and vector store paths
Runtime and security hardening
- history-aware PII and jailbreak signal scanning across prior user turns
- model switch gate fixes for previous-model population
- goroutine panic recovery in extproc background paths
- concurrency race fixes in selection randomness
- path traversal protection for config rollback versions
- dependency security updates across Python, Go, Rust, and frontend surfaces
This is the less flashy part of the release, but it is exactly what Themis is for: making the system safer under real traffic, long prompts, replay storage, and operator-driven config changes.
9. Long-Context Routing Gets Cheaper
Themis adds three important long-context controls.
First, context token estimation can now learn an online calibration ratio from observed response usage, so context-sensitive routing can improve when exact tokenization is unavailable. The fallback remains conservative, but the router can adapt to real traffic over time.
Second, the native mmBERT embedding path now bounds memory without turning long inputs into a silent clipping problem. The #2007 native-binding fix for the long-input memory issue processes attention in query chunks instead of materializing one dense attention tensor for the whole sequence. That keeps the long-context signal available to the router while making the binding usable under larger prompts.

Third, prompt compression becomes a named profile surface for signal extraction:
| Profile | Intended use |
|---|---|
default | Balanced compression for general routing |
coding | Preserve code-like and implementation-heavy sentences |
medical | Preserve clinically relevant detail |
security | Preserve safety and policy evidence |
multi_turn | Preserve conversational continuity |
The compression path is intentionally scoped to signal evaluation. The original user prompt still goes to the selected serving model unless a decision-owned plugin explicitly changes it. That separation keeps routing optimization from silently rewriting user intent.
10. Hardware Backend Paths Broaden
Themis broadens the router-owned model execution story beyond the default local path.
The broadened map separates four paths: NVIDIA CUDA and AMD ROCm for served vLLM backends, Intel OpenVINO for router-owned classifier and embedding inference, and CPU/local execution for development and smoke tests.
On Intel infrastructure, v0.3 adds an initial OpenVINO binding for Semantic Router. The new binding provides native C++ and Go integration for ModernBERT sequence classification, token classification, and embedding inference, with benchmark entrypoints that compare OpenVINO and Candle behavior for classifier and embedding workloads.
This is a backend and binding milestone, not a blanket production-parity claim. It gives contributors and hardware partners a concrete path to validate Semantic Router's internal classifier and embedding models on Intel OpenVINO while preserving the same routing contract used by the rest of Themis.

The AMD deployment path introduced in Athena also remains part of the v0.3 release contract.
The reference flow is still:
vllm-sr serve --platform amdFor real AMD deployments, the project keeps the maintained deploy/recipes/balance.yaml profile, which exposes multiple served aliases through a ROCm vLLM backend and routes them through the same signal, projection, decision, and model-selection pipeline as the CPU/local path.
As part of release readiness, Themis was validated on an AMD ROCm stack with:
- a ROCm vLLM backend exposing the expected served aliases
- dashboard setup import, validate, and activate using the reference balance profile
- router health and Envoy OpenAI-compatible
/v1/models - topology dry-run for a coding/debug request
- direct Envoy chat completions for coding, math, and legal prompts
- dashboard proxy chat completions
- router replay list and aggregate insight APIs

That end-to-end path is important because Semantic Router is meant to be a control plane across heterogeneous inference stacks, not only a local development tool.
11. RouterArena SOTA Refresh
Themis also comes with an external leaderboard signal: in the RouterArena snapshot captured for this release update, vLLM-SR returned to #1 on the RouterArena leaderboard.
In that public RouterArena leaderboard snapshot, vLLM-SR is ranked first by weighted Arena Score with a score of 75.4, ahead of Sqwish Router, AgentForge Router, Nadir Router, and other published router baselines. The same snapshot reports 76.0 accuracy, $0.11 cost per 1K queries, and 73.1 robustness for vLLM-SR.

This is not a substitute for release testing, but it is a useful outside check on the project direction. Themis improves routing policy, cost-aware selection, protocol compatibility, and operational traceability while keeping the router competitive on independent router benchmarks.
What Changed Since v0.2?
At a high level, the v0.2 to v0.3 delta looks like this:
| Area | Themis value |
|---|---|
| API and config | Canonical v0.3 contract across local, dashboard, Helm, and operator paths |
| Router core | Richer signals, projections, response state, replay, safety, and selection algorithms |
| Model selection | Session-aware, multi-factor, latency-aware, RL-driven, hybrid, and other algorithm surfaces |
| Protocols | Stronger OpenAI and Anthropic compatibility with explicit translation behavior |
| Dashboard | Setup, topology, status, logs, insights, replay, auth, and model inventory hardening |
| CLI | Clearer serve modes, model inspection, chat/eval commands, config migration, platform boundaries |
| Deployment | AMD ROCm path, OpenVINO binding, NVIDIA local passthrough ergonomics, Helm/OpenShift/Gateway API fixes, release artifact contracts |
| Storage and retrieval | Valkey, Qdrant, Redis, Milvus, replay, cache, memory, and vector-store lifecycle hardening |
| Reliability | Chunked mmBERT attention, UTF-8-safe display handling, secure logging, streaming cache correctness, replay correctness, concurrency fixes |
That is the core Themis story: the router is more capable, but also more constrained in the right places.
Get Started
For macOS or Linux:
curl -fsSL https://vllm-semantic-router.com/install.sh | bashFor manual installation:
pip install vllm-sr==0.3.0
vllm-sr serveIf the current directory does not contain config.yaml, vllm-sr serve starts the dashboard in setup mode. For YAML-first users, create a canonical v0.3 config directly or migrate an older file:
vllm-sr config migrate --config old-config.yaml
vllm-sr serve --config config.yamlFor AMD ROCm:
vllm-sr serve --platform amdFor local NVIDIA Docker passthrough:
vllm-sr serve --platform nvidiaFor Kubernetes:
helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-routerSee the project resources:
- Documentation: vllm-semantic-router.com
- GitHub: vllm-project/semantic-router
- Reference AMD profile: deploy/recipes/balance.yaml
- Models: Hugging Face
Looking Ahead: v0.4 Hermes
The next release codename is Hermes.
Themis makes the contract stable enough to operate. Hermes should make the router faster to improve, easier to evaluate, and safer to adapt under real workloads. The core Hermes goal is a self-improving router. The loop is deliberate: run auto research for router performance at GPU scale, tune DSL recipes with router evaluation, then feed validated evidence back into the codebase and encoder-model fine-tuning. The highest-value work is:
- Self-improving router as the Hermes core goal: close the loop across GPU-scale performance research, DSL recipe tuning, and codebase plus encoder-model fine-tuning. Every generated change still has to be reviewable, replayable, versioned, and rollback-safe.
- SAAR as the agentic routing layer: continue tightening model-switch economics, tool-loop continuity, provider-state portability, replay diagnostics, and router memory integration.
- Evaluation as a release gate: build system-level and signal-level evaluation so every signal, projection, algorithm, plugin, and dashboard path can be replayed against representative traffic before release.
- CLI-first design: make sure every Semantic Router operation can close the loop through
vllm-sr, including config authoring, migration, serving, inspection, evaluation, replay, policy lifecycle, dashboard import/export, and release smoke tests. - Better router-owned models: improve accuracy and latency for the models the router itself uses, including embedding, classifier, multimodal, and safety signal models.
- More useful signals: add richer request, response, tool, modality, identity, freshness, latency, cost, and runtime-health signals without turning the DSL into application code.
- Operator debugging loop: make what-if routing, policy replay, evaluation-driven tuning, and trace comparison first-class dashboard workflows.

Acknowledgments
From v0.2.0 to v0.3.0, the Themis cycle includes more than 350 commits from 80+ contributor author identities. Thank you to everyone who reviewed code, improved docs, trained models, hardened tests, fixed release blockers, and pushed the router toward a more stable production shape.
We separately thank collaborators from research institutions and universities, including MBZUAI, McGill University, Mila, and Rice University, for contributions and collaboration across router evaluation, model research, and AI systems.
We also thank the broader vLLM, AMD, Intel, Meta, Red Hat, Microsoft, Google, IBM, NVIDIA, Hugging Face, NASA, Nutanix, DaoCloud, and open-source communities for continued collaboration across runtime systems, model serving, model research, and production AI infrastructure.
Welcome to Themis: from signals to stateful production routing.