#hardware

13 posts

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Feb 27, 2026·19 min read

For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.

DeepSeek-V3.2 on GB300: Performance Breakthrough

Feb 13, 2026·12 min read

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Feb 3, 2026·10 min read

Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

Feb 1, 2026·8 min read

TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...

Building Mixture-of-Models on AMD GPUs with vLLM-SR

Jan 23, 2026·10 min read

We are working on building the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems.

AMD × vLLM Semantic Router: Building the System Intelligence Together

Dec 16, 2025·11 min read

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...

Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor

Dec 9, 2025·5 min read

Achieve faster, more efficient LLM serving without sacrificing accuracy!

Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM

Nov 11, 2025·9 min read

Intel® Arc™ Pro B-Series GPU Family GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and scalability...

vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU

Oct 16, 2025·11 min read

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path. It is not only faster than the previous generation...

SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference

Oct 9, 2025·8 min read

Over the past several months, we’ve been collaborating closely with NVIDIA to unlock the full potential of their latest NVIDIA Blackwell GPU architecture (B200/GB200) for large language model...

Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU

May 12, 2025·5 min read

Since December 2024, through the joint efforts of the vLLM community and the Ascend team on vLLM, we have completed the Hardware Pluggable RFC. This proposal allows hardware integration into vLLM...

PTPC-FP8: Boosting vLLM Performance on AMD ROCm

Feb 24, 2025·9 min read

TL;DR: vLLM on AMD ROCm now has better FP8 performance!

Serving LLMs on AMD MI300X: Best Practices

Oct 23, 2024·15 min read

TL;DR: vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B....