vLLM Semantic Router v0.3 Themis: From Signals to Stateful Production Routing

June 5, 202622 min read

vLLM Semantic Router Team

vLLM Semantic Router v0.3, codename Themis, is where semantic routing becomes stateful, observable, and production-ready for real AI traffic.

The previous two releases set the stage. Iris made routing decisions composable. Athena rebuilt the model foundation and expanded the router into memory, safety, model selection, long-context signal handling, OpenClaw orchestration, and AMD ROCm deployment. Themis takes the next step: it makes those capabilities easier to operate, easier to inspect, and harder to misuse.

Since v0.2.0, the project has added more than 350 commits across router core, CLI, dashboard, DSL, Kubernetes, protocol compatibility, model selection, safety, replay, and release readiness. The largest value in v0.3 is not a single feature. It is the convergence of those pieces into one stable contract:

signals become projections, projections feed decisions, decisions choose algorithms, and algorithms select models.

That contract now shows up consistently in the router, the CLI, the dashboard, the DSL, the Helm chart, and the operator-oriented deployment surfaces.

Figure 1: Themis turns signals, policy, operators, and model backends into one inspectable routing control plane.

Why Themis?

Themis represents order, rules, and judgment. That is the right symbol for this release.

Semantic routing is only useful in production if operators can answer basic questions:

Which signals fired?
Which decision matched?
Which model-selection algorithm ran?
Which model was selected?
Which safety or replay plugin changed the path?
Which config version produced this behavior?
Can the same policy be deployed locally, through the dashboard, and in Kubernetes without becoming three different systems?

Themis is about making those answers explicit. v0.3 keeps the ambition of Athena, but puts stronger boundaries around the runtime, the API surface, and the operational workflow.

Figure 2: The release value is not one isolated feature. It is the connection between stable contracts, inspection, operations, serving, long context, and validation.

What's New in v0.3 Themis?

1. A Canonical v0.3 Configuration Contract

The most important Themis change is the new canonical config shape:

version: v0.3
listeners: []
providers: {}
routing: {}
global: {}

Before v0.3, users could encounter overlapping layouts across local Docker, dashboard-generated config, Helm values, CRDs, examples, and older docs. Themis makes config.yaml the steady-state file and aligns the system around the same top-level architecture everywhere.

That cleanup also removes vllm-sr init. The new flow is simpler:

use vllm-sr serve from an empty directory for dashboard-first setup
author canonical config.yaml directly for YAML-first workflows
migrate older files with vllm-sr config migrate --config old-config.yaml
import supported provider inventories with vllm-sr config import

This is a breaking change, but it is the right kind of breaking change for a pre-1.0 router: fewer config dialects, clearer ownership, and a more durable public contract.

The config path is also stricter at the edges. v0.3 warns on unknown YAML fields, keeps canonical config loading covered by tests, aligns Python CLI models with modern Pydantic configuration, and gates classifier assets more explicitly. The goal is simple: typos and stale config shapes should be caught before they become silent routing drift.

Figure 3: Local YAML, CLI, dashboard, and Kubernetes now converge on the same canonical v0.3 config shape.

2. Signal, Projection, Decision, Algorithm, Model

Themis makes the router's mental model more explicit:

Layer	What it owns
Signal	Extract evidence from the request, response, tools, language, domain, context, modality, identity, or safety classifiers
Projection	Normalize raw evidence into policy-ready concepts such as verification, urgency, feedback, or balance
Decision	Match named routing policies with priority and explainable conditions
Algorithm	Choose among candidate models inside a matched decision
Model	Serve the request through the selected backend alias or provider

This matters because v0.3 adds enough routing intelligence that implicit behavior is no longer acceptable. The router now has richer signal families, projection traces, advanced model-selection algorithms, and response-side plugins. Themis keeps those surfaces programmable without turning routing policy into hidden application code.

The current signal catalog is broad enough to describe not only the latest user prompt, but also safety posture, tool loops, user roles, multimodal intent, conversation shape, structured events, and replayable knowledge-base evidence:

Signal family	What it captures	Typical use
`authz`	Role and subject bindings from user or group context	Premium/admin routing, policy-gated models
`complexity`	Reasoning difficulty from learned or composed signals	Escalate hard synthesis and multi-step reasoning
`context`	Estimated context-window demand	Long-context routing, cost and latency decisions
`conversation`	Message and tool-loop shape	Multi-turn, active tool use, developer messages, heavy non-user context
`domain`	Learned or configured domain labels	Business, law, health, computer-science routing
`embedding`	Semantic similarity against candidate anchors, including text/image/audio query modality	Support intent, clinical intent, multimodal request matching
`event`	Structured event metadata, severity, action codes, and temporal urgency	Incident, payment, audit, or operational event routing
`fact_check`	Whether a request needs factual verification	Escalate legal, medical, or factual claims
`jailbreak`	Prompt-injection and jailbreak evidence, including history-aware scanning	Safety routing and response-side guardrails
`kb`	Knowledge-base group or label matches	Privacy policy, containment, frontier reasoning, local standard routes
`keyword`	Literal, fuzzy, BM25, or n-gram keyword evidence	Fast route guards, urgent keywords, sensitive terms
`language`	Detected language with configurable confidence	Locale-aware routing and multilingual model choice
`modality`	AR, diffusion, or mixed text/image execution needs	Choose text-only, image-generation, or multimodal paths
`pii`	Sensitive entity policy, including history-aware scanning	Redaction, deny/allow decisions, privacy routes
`preference`	User style or behavior preference examples	Terse answers, detailed answers, domain-specific style
`reask`	Repeated or rephrased user turns	Detect likely dissatisfaction in prior turns
`structure`	Regex, count, sequence, or density features	Many questions, numbered workflows, format-heavy prompts
`user_feedback`	User says an answer was wrong or needs clarification	Recover from dissatisfaction or route to stronger models

Projection outputs are referenced with type: projection, but they are derived routing surfaces rather than another raw signal family. That distinction matters: signals extract evidence, while projections turn evidence into named policy bands such as support_fast, support_balanced, or support_escalated.

The main v0.3 additions are not just more signal names. The release makes signals composable: conversation signals can detect agentic request shape; event signals can route operational payloads; embedding rules can query non-text modalities; and projection outputs can turn noisy evidence into policy-ready bands.

The dashboard topology view, the DSL editor, the compiler/decompiler, and runtime metrics were updated to understand these v0.3 surfaces instead of silently dropping or hiding them.

The policy-authoring surface is also stronger. The routing DSL gained conflict detection, SIGNAL_GROUP, TEST, and TIER authoring constructs, a natural-language-to-DSL pipeline, EMIT retention, and dynamic tool retrieval support. That matters for production teams because Themis policies are not just parsed YAML; they are reviewable routing programs with tests, retained outputs, and safer generation paths.

Figure 4: The routing contract is now visible as a pipeline from request evidence to signal, projection, decision, algorithm, model, and replay.

3. Session-Aware Agentic Routing

Themis includes the first production-ready version of Session-Aware Agentic Routing (SAAR).

Single-turn routing asks:

Which model should handle this prompt?

Agentic routing also has to ask:

Is it safe to switch models inside this session right now?

SAAR adds router-owned session memory, hard locks around tool loops, provider-state portability checks, idle and decision-drift reset boundaries, switch economics, and replayable diagnostics. It keeps the normal Semantic Router pipeline, but wraps model selection with session continuity rules.

This is especially important for coding agents and long-horizon tool loops. A tool result should usually return to the model that asked for the tool. A provider-managed continuation id should not be sent to a different physical backend. A long warm session should not throw away prefix locality just because the latest user message is short.

Themis makes those constraints part of the model-selection policy instead of asking every application to rediscover them.

Figure 5: SAAR keeps multi-turn agent sessions stable by combining router-owned session memory, hard locks, portability checks, switch economics, and replay diagnostics.

The key design choice is that SAAR does not replace semantic routing. It adds a stateful guard around the last mile of model selection:

conversation signals identify multi-turn shape, active tool use, developer messages, and heavy non-user context.
session_aware selection evaluates whether a model switch is worth it after considering quality gap, switch margin, stay bias, prefix locality, and remaining-turn priors.
Hard locks stop unsafe switches during active tool loops or provider-state continuations.
Router-owned memory can retrieve and store route-local facts, preferences, and context without exposing a separate session-state DSL.
Replay records preserve the reason a session stayed, switched, or reset.

Router memory is the durable complement to session-aware selection. The memory plugin can preserve facts, preferences, and retrieved context under user or session scope; session_aware can then avoid treating every turn as an isolated request. In practice, that means an agent can keep useful continuity without pinning every request to the most expensive model forever.

The reference policy shape is intentionally ordinary YAML:

routing:
  signals:
    conversation:
      - name: active_tool_use
        feature:
          type: count
          source:
            type: assistant_tool_cycle
        predicate:
          gte: 1
 
  decisions:
    - name: agentic_session_route
      rules:
        operator: AND
        conditions:
          - type: conversation
            name: active_tool_use
      algorithm:
        type: session_aware
        session_aware:
          base_method: hybrid
          tool_loop_hard_lock: true
          context_portability_hard_lock: true
          prefix_cache_weight: 0.20
          handoff_penalty_weight: 1.0
      plugins:
        - type: memory
          configuration:
            enabled: true
            retrieval_limit: 6
            auto_store: true
            hybrid_search: true

That is the part of Themis that matters most for agentic workloads: the router can now reason about continuity, not only classification.

4. Projections Turn Evidence Into Policy

Signals are raw evidence. Projections are where Themis turns that evidence into named, stable policy concepts.

Without projections, a complex policy has to repeat low-level signal details across many decisions: exact embedding rule names, complexity thresholds, context boundaries, and knowledge-base scores. With projections, the router can compute the raw evidence once, derive a reusable output such as support_fast or support_escalated, and let decisions route on that derived concept.

Themis supports three core projection patterns:

partitions choose one winner from an exclusive family, such as competing support intents.
scores combine declared signals or knowledge-base metrics into a continuous value.
mappings turn those values into policy bands through calibrated thresholds.

For policies that need more than one derived output, v0.3 also adds multi_emit projection mappings. That lets a single projection step emit multiple named routing concepts while still preserving traceability in replay.

Figure 6: Projections transform noisy signal evidence into named outputs that decisions can reference directly.

A compact example looks like this:

routing:
  signals:
    embeddings:
      - name: technical_support
        threshold: 0.75
        aggregation_method: max
        candidates:
          - installation guide
          - troubleshooting steps
      - name: account_management
        threshold: 0.72
        aggregation_method: any
        candidates:
          - password reset
          - billing information
    context:
      - name: long_context
        min_tokens: 32K
        max_tokens: 256K
 
  projections:
    partitions:
      - name: support_intents
        semantics: exclusive
        members:
          - technical_support
          - account_management
        default: technical_support
    scores:
      - name: request_difficulty
        method: weighted_sum
        inputs:
          - type: embedding
            name: technical_support
            weight: 0.18
            value_source: confidence
          - type: context
            name: long_context
            weight: 0.18
    mappings:
      - name: request_band
        source: request_difficulty
        method: threshold_bands
        outputs:
          - name: support_fast
            lte: 0.20
          - name: support_escalated
            gte: 0.45
 
  decisions:
    - name: escalated_support_route
      rules:
        operator: AND
        conditions:
          - type: projection
            name: support_escalated

Projection traces are also stored with replay records, so the dashboard can explain not only which signal fired, but also which derived policy band caused the final route.

5. Protocol Compatibility Becomes a Release Surface

v0.3 expands the router's compatibility boundary beyond basic OpenAI Chat Completions.

The protocol work in this cycle includes:

native Anthropic /v1/messages ingress through an internal request envelope
Anthropic streaming with OpenAI SSE translation
custom Anthropic upstream routing and tool-calling support
outbound Anthropic response emission for non-streaming paths
protocol detection from request path headers
session-id mirroring and header pass-through controls
response headers that explain when protocol translation is lossy
Responses API tool-trace fidelity and OpenAI SDK-aligned message handling
OpenAI reasoning-effort mutation fixes
identity-encoded upstream responses to avoid transparent decompression surprises
stronger Responses API state and persistence paths

The goal is not to make every provider look identical. The goal is to make translation explicit, observable, and safe enough that a logical routing model such as auto can sit in front of multiple provider protocols without surprising operators.

6. The Dashboard Becomes an Operator Console

The Themis dashboard is more than a config editor.

The v0.3 cycle tightens the first-run setup flow, topology graph, replay-backed insights, logs, status pages, evaluation flows, auth behavior, and model inventory surfaces. Operators can import a profile, validate it, activate it, send test prompts, inspect signal paths, read router logs, and verify replay records without leaving the dashboard.

Figure 7: The dashboard becomes a practical operator console for setup, topology inspection, logs, playground testing, replay, and model health.

Notable dashboard improvements include:

built-in routing modes and missing-model completion
topology dry-run paths that show matched signals, projections, decisions, and models
router replay and aggregate insights through the dashboard proxy
natural-language DSL builder and evaluation-flow fixes
file attachments in the playground
auth fail-closed behavior when the auth service cannot initialize
policy version lifecycle with shadow, activate, and revert states
safer logs and URL redaction for user-supplied fetch/open-web requests
UTF-8-safe display handling for multilingual content
slimmer production route shell and smaller backend runtime dependencies
dashboard-aware model list and status surfaces

The result is a better local and remote operator workflow: setup mode for first run, topology for policy inspection, logs/status for operations, and insights for real traffic.

7. CLI and Deployment Are More Predictable

Themis also strengthens vllm-sr as the supported operating interface.

The CLI now has clearer runtime boundaries and more useful commands:

vllm-sr serve
vllm-sr serve --algorithm latency_aware
vllm-sr serve --algorithm session_aware
vllm-sr serve --platform amd
vllm-sr serve --platform nvidia
vllm-sr chat
vllm-sr eval
vllm-sr model list
vllm-sr config migrate --config old-config.yaml

Local vllm-sr serve remains a Docker-based workflow on Linux, macOS, and WSL2. AMD ROCm remains the release-validated GPU path, while --platform nvidia adds local NVIDIA Docker passthrough ergonomics for users who already have the NVIDIA container runtime configured. Native Windows Docker serving is now rejected with an explicit support message rather than failing later in less obvious ways.

The CLI also grows better inspection and smoke-test commands. vllm-sr model list surfaces configured model inventory, vllm-sr chat provides a one-shot completion path, vllm-sr eval exercises router evaluation endpoints, and VLLM_SR_DNS lets local containers join custom DNS environments when enterprise or lab networks require it.

On Kubernetes, v0.3 aligns Helm, release defaults, OpenShift deployment fixes, multiple IntelligentRoute reconcile behavior, CRD modality contracts, optional Gateway API HTTPRoute ingress, and AgentGateway installation guidance. For release operations, Themis also moves away from vague latest assumptions and toward explicit artifact contracts, upgrade and rollback documentation, and release checks.

8. Safety, Replay, Memory, and Retrieval Are More Trustworthy

Athena brought many of these capabilities into the router. Themis hardens them.

Key runtime fixes and improvements now fall into three groups:

Replay and observability

router replay PostgreSQL insert correctness so dashboard insights do not silently stay empty
projection traces stored with replay records for better explainability
response-side jailbreak and replay path tightening

Storage and retrieval

Qdrant vector search provider support
Valkey cache, vector store, and memory backend support, including TLS and search-module prechecks
Redis and Responses API storage defaults that better match real local and Kubernetes deployments
hybrid cache rebuild preallocation reduction
streaming Redis semantic-cache correctness and bounded streaming chunk memory behavior
O(N) cache-LRU read paths replaced with a constant-time list-backed implementation
BM25 and n-gram classification caching to avoid amplified work
hybrid HNSW entry-point propagation fixes
shared Milvus lifecycle handling across replay, cache, memory, and vector store paths

Runtime and security hardening

history-aware PII and jailbreak signal scanning across prior user turns
model switch gate fixes for previous-model population
goroutine panic recovery in extproc background paths
concurrency race fixes in selection randomness
path traversal protection for config rollback versions
dependency security updates across Python, Go, Rust, and frontend surfaces

This is the less flashy part of the release, but it is exactly what Themis is for: making the system safer under real traffic, long prompts, replay storage, and operator-driven config changes.

9. Long-Context Routing Gets Cheaper

Themis adds three important long-context controls.

First, context token estimation can now learn an online calibration ratio from observed response usage, so context-sensitive routing can improve when exact tokenization is unavailable. The fallback remains conservative, but the router can adapt to real traffic over time.

Second, the native mmBERT embedding path now bounds memory without turning long inputs into a silent clipping problem. The #2007 native-binding fix for the long-input memory issue processes attention in query chunks instead of materializing one dense attention tensor for the whole sequence. That keeps the long-context signal available to the router while making the binding usable under larger prompts.

Figure 8: The long-context path preserves the signal and bounds native memory by chunking mmBERT attention work.

Third, prompt compression becomes a named profile surface for signal extraction:

Profile	Intended use
`default`	Balanced compression for general routing
`coding`	Preserve code-like and implementation-heavy sentences
`medical`	Preserve clinically relevant detail
`security`	Preserve safety and policy evidence
`multi_turn`	Preserve conversational continuity

The compression path is intentionally scoped to signal evaluation. The original user prompt still goes to the selected serving model unless a decision-owned plugin explicitly changes it. That separation keeps routing optimization from silently rewriting user intent.

10. Hardware Backend Paths Broaden

Themis broadens the router-owned model execution story beyond the default local path.

The broadened map separates four paths: NVIDIA CUDA and AMD ROCm for served vLLM backends, Intel OpenVINO for router-owned classifier and embedding inference, and CPU/local execution for development and smoke tests.

On Intel infrastructure, v0.3 adds an initial OpenVINO binding for Semantic Router. The new binding provides native C++ and Go integration for ModernBERT sequence classification, token classification, and embedding inference, with benchmark entrypoints that compare OpenVINO and Candle behavior for classifier and embedding workloads.

This is a backend and binding milestone, not a blanket production-parity claim. It gives contributors and hardware partners a concrete path to validate Semantic Router's internal classifier and embedding models on Intel OpenVINO while preserving the same routing contract used by the rest of Themis.

Figure 9: Themis broadens the hardware backend map while keeping one routing control plane across NVIDIA CUDA, AMD ROCm, Intel OpenVINO, and CPU/local paths.

The AMD deployment path introduced in Athena also remains part of the v0.3 release contract.

The reference flow is still:

vllm-sr serve --platform amd

For real AMD deployments, the project keeps the maintained deploy/recipes/balance.yaml profile, which exposes multiple served aliases through a ROCm vLLM backend and routes them through the same signal, projection, decision, and model-selection pipeline as the CPU/local path.

As part of release readiness, Themis was validated on an AMD ROCm stack with:

a ROCm vLLM backend exposing the expected served aliases
dashboard setup import, validate, and activate using the reference balance profile
router health and Envoy OpenAI-compatible /v1/models
topology dry-run for a coding/debug request
direct Envoy chat completions for coding, math, and legal prompts
dashboard proxy chat completions
router replay list and aggregate insight APIs

Figure 10: The AMD release path validates serve, dashboard import, router health, model listing, ROCm backend serving, and routed requests as one flow.

That end-to-end path is important because Semantic Router is meant to be a control plane across heterogeneous inference stacks, not only a local development tool.

11. RouterArena SOTA Refresh

Themis also comes with an external leaderboard signal: in the RouterArena snapshot captured for this release update, vLLM-SR returned to #1 on the RouterArena leaderboard.

In that public RouterArena leaderboard snapshot, vLLM-SR is ranked first by weighted Arena Score with a score of 75.4, ahead of Sqwish Router, AgentForge Router, Nadir Router, and other published router baselines. The same snapshot reports 76.0 accuracy, $0.11 cost per 1K queries, and 73.1 robustness for vLLM-SR.

Figure 11: RouterArena leaderboard snapshot showing vLLM-SR back at #1 by weighted Arena Score.

This is not a substitute for release testing, but it is a useful outside check on the project direction. Themis improves routing policy, cost-aware selection, protocol compatibility, and operational traceability while keeping the router competitive on independent router benchmarks.

What Changed Since v0.2?

At a high level, the v0.2 to v0.3 delta looks like this:

Area	Themis value
API and config	Canonical v0.3 contract across local, dashboard, Helm, and operator paths
Router core	Richer signals, projections, response state, replay, safety, and selection algorithms
Model selection	Session-aware, multi-factor, latency-aware, RL-driven, hybrid, and other algorithm surfaces
Protocols	Stronger OpenAI and Anthropic compatibility with explicit translation behavior
Dashboard	Setup, topology, status, logs, insights, replay, auth, and model inventory hardening
CLI	Clearer serve modes, model inspection, chat/eval commands, config migration, platform boundaries
Deployment	AMD ROCm path, OpenVINO binding, NVIDIA local passthrough ergonomics, Helm/OpenShift/Gateway API fixes, release artifact contracts
Storage and retrieval	Valkey, Qdrant, Redis, Milvus, replay, cache, memory, and vector-store lifecycle hardening
Reliability	Chunked mmBERT attention, UTF-8-safe display handling, secure logging, streaming cache correctness, replay correctness, concurrency fixes

That is the core Themis story: the router is more capable, but also more constrained in the right places.

Get Started

For macOS or Linux:

curl -fsSL https://vllm-semantic-router.com/install.sh | bash

For manual installation:

pip install vllm-sr==0.3.0
vllm-sr serve

If the current directory does not contain config.yaml, vllm-sr serve starts the dashboard in setup mode. For YAML-first users, create a canonical v0.3 config directly or migrate an older file:

vllm-sr config migrate --config old-config.yaml
vllm-sr serve --config config.yaml

For AMD ROCm:

vllm-sr serve --platform amd

For local NVIDIA Docker passthrough:

vllm-sr serve --platform nvidia

For Kubernetes:

helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router

See the project resources:

Documentation: vllm-semantic-router.com
GitHub: vllm-project/semantic-router
Reference AMD profile: deploy/recipes/balance.yaml
Models: Hugging Face

Looking Ahead: v0.4 Hermes

The next release codename is Hermes.

Themis makes the contract stable enough to operate. Hermes should make the router faster to improve, easier to evaluate, and safer to adapt under real workloads. The core Hermes goal is a self-improving router. The loop is deliberate: run auto research for router performance at GPU scale, tune DSL recipes with router evaluation, then feed validated evidence back into the codebase and encoder-model fine-tuning. The highest-value work is:

Self-improving router as the Hermes core goal: close the loop across GPU-scale performance research, DSL recipe tuning, and codebase plus encoder-model fine-tuning. Every generated change still has to be reviewable, replayable, versioned, and rollback-safe.
SAAR as the agentic routing layer: continue tightening model-switch economics, tool-loop continuity, provider-state portability, replay diagnostics, and router memory integration.
Evaluation as a release gate: build system-level and signal-level evaluation so every signal, projection, algorithm, plugin, and dashboard path can be replayed against representative traffic before release.
CLI-first design: make sure every Semantic Router operation can close the loop through vllm-sr, including config authoring, migration, serving, inspection, evaluation, replay, policy lifecycle, dashboard import/export, and release smoke tests.
Better router-owned models: improve accuracy and latency for the models the router itself uses, including embedding, classifier, multimodal, and safety signal models.
More useful signals: add richer request, response, tool, modality, identity, freshness, latency, cost, and runtime-health signals without turning the DSL into application code.
Operator debugging loop: make what-if routing, policy replay, evaluation-driven tuning, and trace comparison first-class dashboard workflows.

Figure 12: Hermes centers on a self-improving router that connects GPU-scale performance research, DSL recipe tuning, router evaluation, codebase updates, and encoder fine-tuning.

Acknowledgments

From v0.2.0 to v0.3.0, the Themis cycle includes more than 350 commits from 80+ contributor author identities. Thank you to everyone who reviewed code, improved docs, trained models, hardened tests, fixed release blockers, and pushed the router toward a more stable production shape.

We separately thank collaborators from research institutions and universities, including MBZUAI, McGill University, Mila, and Rice University, for contributions and collaboration across router evaluation, model research, and AI systems.

We also thank the broader vLLM, AMD, Intel, Meta, Red Hat, Microsoft, Google, IBM, NVIDIA, Hugging Face, NASA, Nutanix, DaoCloud, and open-source communities for continued collaboration across runtime systems, model serving, model research, and production AI infrastructure.

Welcome to Themis: from signals to stateful production routing.