#multimodal

13 posts

Experience and Lessons Learned from Serving Multi-Stage Qwen3-Omni in vLLM-Omni

Jul 1, 2026·12 min read

How vLLM-Omni serves and optimizes Qwen3-Omni with staged Thinker-Talker-Code2Wav execution, batching, CUDA Graphs, async chunk, async output, replicas, hot-path cleanup, and perf validation.

Engineering TTS Inference in vLLM-Omni

Jun 23, 2026·23 min read

How vLLM-Omni supports and optimizes Qwen3-TTS, VoxCPM2, Higgs Audio V3, and Fish Speech S2 Pro with staged serving, batching, CUDA Graphs, and model-specific kernels.

Accelerating vLLM-Omni Inference with AutoRound Quantization

Jun 2, 2026·10 min read

How AutoRound integrates with vLLM-Omni to serve W4A16 quantized multimodal, diffusion, image, and video models with smaller checkpoints, preserved quality, Intel XPU acceleration, and NVIDIA GPU support.

Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models

May 14, 2026·7 min read

How VeRL-Omni extends verl with vLLM-Omni for reinforcement learning post-training of diffusion and multimodal generative models, including efficient rollouts, reward inference, trainers, hardware support, and recipes.

Streaming Requests & Realtime API in vLLM

Jan 31, 2026·19 min read

How vLLM supports streamable inputs and a Realtime WebSocket API for audio, video, robotics, and low-latency applications that need incremental input processing instead of complete prompts.

vLLM-Omni Diffusion Cache Acceleration

Dec 19, 2025·4 min read

How vLLM-Omni speeds up diffusion model inference with Cache-DiT and TeaCache, reusing intermediate computations across timesteps to deliver 1.5x to 2x image generation speedups with minimal quality loss.

Encoder Disaggregation for Scalable Multimodal Model Serving

Dec 15, 2025·9 min read

How vLLM EPD separates visual encoders from text prefill and decode, covering LMM serving, GPU resource scaling, multimodal interference, and pipelined execution.

Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

Nov 30, 2025·4 min read

What vLLM-Omni adds to the vLLM ecosystem: omni-modality serving for text, image, video, and audio, diffusion and non-autoregressive generation support, disaggregated stages, OpenAI-compatible APIs, and pipelined execution.

Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems

Nov 13, 2025·7 min read

How shared memory IPC caching in vLLM reduces redundant data transfers for multimodal and multi-process inference, improving prefill throughput and TTFT by sharing large inputs across coordinator and worker processes.

Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM

Oct 31, 2025·4 min read

How to serve NVIDIA Nemotron Nano 2 VL with vLLM for multimodal reasoning agents, including video understanding, document intelligence, Efficient Video Sampling, 128K context, and OpenAI-compatible deployment.

Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM

Sep 5, 2025·10 min read

How vLLM expands beyond text generation to serve geospatial, vision, and other non-autoregressive models with pooling-model support, TerraTorch integration, raw tensor handling, and flexible IO processors.

GLM-4.5 Meets vLLM: Built for Intelligent Agents

Aug 19, 2025·4 min read

How to run GLM-4.5 and GLM-4.5V with vLLM for intelligent agents, including hybrid reasoning modes, FP8 and BF16 serving, multimodal support, and NVIDIA Blackwell and Hopper deployment.

Llama 4 in vLLM

Apr 5, 2025·4 min read

How vLLM serves Meta Llama 4 Scout and Maverick multimodal MoE models with long-context support, tensor parallel deployment, H100 and H200 guidance, FP8 variants, and performance tips.