#speculative-decoding

11 posts

From Day 0 to Production SLAs: Serving GLM-5.2 on 24 NVIDIA B300 GPUs with vLLM

Jul 23, 2026·18 min read

How we took GLM-5.2-NVFP4 from 40 ms to 17 ms mean TPOT on 24 B300 GPUs with vLLM: P/D disaggregation, MTP speculative decoding, Model Runner V2, and the SLA-first trade-offs behind the final configuration.

TML Inkling on vLLM: Day-0 Support with Optimized Performance

Jul 15, 2026·8 min read

vLLM brings day-0 support to TML Inkling, a 1T-parameter multimodal model, with MTP, long-context serving, parallelism, and up to 380 tokens per second per user on NVIDIA GB200 GPUs.

Accelerating Laguna XS.2 Inference with vLLM, Speculators, and LLM Compressor

May 28, 2026·3 min read

How Laguna XS.2 is served and optimized in vLLM using first-class model integration, a DFlash speculator trained with Speculators, and FP8, NVFP4, INT4, and INT8 checkpoints from LLM Compressor.

Speculators v0.5.0: DFlash Support and Online Training

May 28, 2026·6 min read

What Speculators v0.5.0 adds for vLLM speculative decoding: DFlash block-diffusion draft models, unified online and offline training, native hidden-state extraction, and Gemma 4 latency results.

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

May 26, 2026·4 min read

How EAGLE 3.1 improves speculative decoding robustness in vLLM with FC normalization, post-norm hidden-state feedback, TorchSpec training support, and config-driven compatibility with EAGLE 3 checkpoints.

vLLM Tops the Artificial Analysis Leaderboard

May 11, 2026·15 min read

How vLLM achieved leading Artificial Analysis results for DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B using open-source kernel fusion, speculative decoding, Blackwell optimizations, and model-specific serving work.

Extracting hidden states from vLLM

Mar 30, 2026·8 min read

How vLLM extracts verifier hidden states through dummy draft models and KV Connector APIs for speculative decoding, enabling offline and online Speculators training without patching vLLM internals.

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Mar 13, 2026·12 min read

How P-EAGLE brings parallel speculative decoding to vLLM by generating multiple draft tokens in one forward pass, with pre-trained drafter heads, config support, and B200 speedups over EAGLE-3.

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

Dec 13, 2025·11 min read

How Speculators v0.3.0 supports end-to-end Eagle3 draft model training for vLLM, including hidden-state data generation, MoE and non-MoE verifiers, offline workflows, and seamless speculative decoding serving.

Inside vLLM: Anatomy of a High-Throughput LLM Inference System

Sep 5, 2025·41 min read

How vLLM's inference engine works, covering PagedAttention, continuous batching, prefix caching, speculative decoding, multi-GPU serving, scheduling, and benchmarking for high-throughput LLM workloads.

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Oct 17, 2024·10 min read

How speculative decoding works in vLLM, covering EAGLE, Medusa, n-gram proposals, draft and target runners, scheduler and memory-manager changes, and continuous batching for lower token latency.