#vllm-omni

2 posts

Experience and Lessons Learned from Serving Multi-Stage Qwen3-Omni in vLLM-Omni

Jul 1, 2026·12 min read

How vLLM-Omni serves and optimizes Qwen3-Omni with staged Thinker-Talker-Code2Wav execution, batching, CUDA Graphs, async chunk, async output, replicas, hot-path cleanup, and perf validation.

Accelerating vLLM-Omni Inference with AutoRound Quantization

Jun 2, 2026·10 min read

How AutoRound integrates with vLLM-Omni to serve W4A16 quantized multimodal, diffusion, image, and video models with smaller checkpoints, preserved quality, Intel XPU acceleration, and NVIDIA GPU support.