#moe

3 posts

vLLM × HPC-Ops: High-Performance Attention and MoE Backends from Tencent Hunyuan

Jul 6, 2026·12 min read

How HPC-Ops integrates Hopper-optimized attention and FP8 MoE backends into vLLM for Tencent Hunyuan Hy3, improving mixed-length decode, MoE latency, TTFT, and TPOT on NVIDIA H20.

MiniMax M3 in vLLM: Day-0 Serving for 1M-Token Multimodal Reasoning

Jun 12, 2026·21 min read

How vLLM serves MiniMax M3 with MiniMax Sparse Attention, multimodal and reasoning parsers, MXFP8 weights, and long-context deployment recipes.

Elastic Expert Parallelism in vLLM

May 14, 2026·11 min read

How Elastic Expert Parallelism lets vLLM scale Mixture-of-Experts serving up or down at runtime by changing data-parallel workers, redistributing experts, and coordinating live topology changes without server restarts.