Announcing VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models

May 14, 20267 min read

VeRL-Omni Team

We are excited to announce the pre-release of VeRL-Omni, a general reinforcement learning (RL) post-training framework focused on multimodal generative models, built on top of verl and vllm-omni.

Why VeRL-Omni?

RL has become a powerful method for aligning large generative models with human preferences and downstream task rewards. While the LLM RL stack has evolved rapidly over the past year, multimodal generative RL, covering diffusion and omni-modality models for image/video/audio understanding and generation, faces critical needs:

Diffusion and omni-modality extension: Extending verl's exceptional flexibility and performance to the world of multi-modal and non-autoregressive RL training, covering diffusion transformer backbones (Qwen-Image), mixed AR-DiT architectures (Qwen-Omni), and unified understanding & generation models (BAGEL, HunyuanImage3.0).
Heterogeneous rollout pipelines: Rollouts are denoising trajectories in a continuous latent space rather than token sequences, and a single rollout may invoke multiple heterogeneous model components and multi-stage pipelines (e.g., text encoder → DiT → VAE).
Complex workload scheduling: Orchestrating complex multi-modal RL training workflows, where reward functions are themselves multimodal models (VLM judges, OCR scorers, etc.) and multi-modal generation rollouts have higher memory peaks compared to text generation.

Key Features

Efficient multimodal rollout: We integrate vLLM-Omni for its high-throughput async serving for multimodal generation while maintaining accuracy on par with diffusers. VeRL-Omni works with vLLM-Omni to continuously optimize rollout efficiency via step-wise continuous batching, embedding caching, etc.
Flexible reward engine: Spanning rule-based rewards and model-based rewards (e.g. VLM-as-judge for OCR). vLLM is integrated for efficient VLM and LLM reward model inference. Reward computation is overlapped with ongoing rollout and training processes to reduce end-to-end latency.
Modular training backends: Provide various trainers (DiffusersFSDP/Megatron/VeOmni) with built-in optimization for diffusion and omni-modal models, allowing easy integration of different parallelism strategies (FSDP/USP/TP).
Broad hardware compatibility: Supports both NVIDIA GPUs and Ascend NPUs, allowing flexible deployment across diverse hardware backends.
E2E training recipes and benchmarks: Provided with reference performance results, which can achieve high training throughput thanks to the above features.

Algorithm and Model Support

Model	Architecture	Modality	Algorithm	Status
Qwen-Image	DiT	Text → Image	FlowGRPO, MixGRPO, GRPO-Guard	Released
BAGEL	Unified understand + gen	Text + Image	FlowGRPO	PR ready
Qwen3-Omni-Thinker	AR	Text / Image / Video / Audio	GSPO	PR ready
Wan2.2	DiT	Text → Video	DanceGRPO	WIP
SD3.5	DiT	Text → Image	DPO	WIP
HunyuanImage-3.0	Unified understand + gen	Text + Image	MixGRPO, SRPO	Planned

Getting Started

Installation

Check out our Installation Doc for details.

Training diffusion models

Check out our examples directory for specific scripts to launch different RL algorithm trainers for image/audio/video understanding and generation tasks. You can track the training performance and results via wandb.

Demo: Qwen-Image FlowGRPO Post-training

In the flowgrpo example, we train Qwen-Image with the OCR reward task. The reward model is Qwen3-VL-8B-Instruct, scoring generated images by reading the rendered text and comparing it against the dataset ground truth.

Algorithm Review

FlowGRPO Demonstration

FlowGRPO is an online policy method for flow-matching models. It employs multi-step SDE sampling with a diffusion policy model to enable effective RL exploration, and adopts model-based rewards to assess generation quality. The training workflow mainly consists of four key stages:

Rollout Generation: The diffusion policy model generates sample rollouts, collecting trajectories of log probabilities and generated images.
Reward Model Scoring: The reward model scores each generated sample, allowing the computation of trajectory advantages.
Policy Optimization: The policy is updated using a FlowGRPO CLIP-style loss, optimizing for higher reward using the computed advantages.
Weight Synchronization: Periodically, the latest policy weights from the trainer are synchronized to the rollout workers, ensuring that generated samples reflect the most recent policy.

LoRA fine-tuning

The training throughput on NVIDIA H800 GPUs is as follows.

Mode	# GPUs	Actor	Rollout	Async Reward	Throughput (images/GPU/s)	Time per Step (s)
FlowGRPO colocated training	4	4	4	0 (sync)	0.305	420
FlowGRPO w/ async reward	5	4	4	1 (async)	0.280	360

Moving the reward model to its own dedicated GPU reduces wall-clock time per step by ~14% by overlapping reward evaluation with policy training.

Full-model fine-tuning

We have also validated non-CFG full-model Qwen-Image OCR training on 4 × NVIDIA H200 GPUs, reaching 0.510 images/GPU/s at ~250 s/step.

As shown below, the text rendering quality of the generated images is largely enhanced in 120 training steps.

Prompt	Training Step 0	Training Step 120
A wooden trail marker in a dense forest with "Hidden Trail" carved into the wood, surrounded by moss and foliage.	Hidden Trail — step 0	Hidden Trail — step 120
A birthday card interior with "Make A Wish" in cursive handwriting, surrounded by sparkling candles and colorful confetti.	Make A Wish — step 0	Make A Wish — step 120

Below are reward and training curves from our reference runs. Both the critic reward and validation reward converge stably during training.

Validation reward rising from 0.7 to 0.95 validation reward increases stably	Rollout reward mean rising from ~0.15 to ~0.9 rollout reward mean increases (low start expected for non-CFG rollout)
critic/rewards/zero_std_ratio rising only after reward saturates zero-std ratio climbs only after reward saturates	$actor/pg_clipfrac staying in healthy range$ actor/pg_clipfrac staying in healthy range clip ratio stays in healthy range

For a detailed overview of training metrics, please see our Training Metrics documentation.

Future Roadmap

VeRL-Omni is actively evolving and currently in pre-release, with a stable core diffusion RL stack. Our roadmap is focused on expanding model and algorithm support, and pushing the boundaries of efficient multi-modal RL training.

Model Support Extension: Support a wide range of open-source diffusion and omni-modal models as they emerge, covering image/video/audio generation tasks and unified understanding & generation tasks.
Algorithm Support Extension: Integrate stable and advanced RL algorithms as they are proposed, such as DiffusionNFT.
Fully Asynchronous RL: End-to-end async pipelines across actor, rollout, and reward, beyond the current async-reward setup, in order to improve the training throughput and GPU/NPU utilization.
Co-optimization with vLLM-Omni: Generation rollout accounts for a large portion of training time. We expect to further accelerate multimodal rollout by closely integrating with vLLM-Omni, leveraging advanced techniques such as parallelism, quantization, batching, and optimized request scheduling.
Efficient Omni-modal Trainer: Besides DiffusersFSDPTrainer, we expect to release more highly-optimized trainer engines for omni-modality and diffusion models, based on Megatron-core and VeOmni.
Broader hardware support: Continuing to harden the Ascend NPU path and welcoming additional hardware backends through the hardware plugin system.

Join the Community

This is just the beginning for diffusion and omni-modal RL post-training. We are actively developing support for more architectures and algorithms, and invite the community to help shape the future of VeRL-Omni.

Code: github.com/verl-project/verl-omni
Docs: verl-omni.readthedocs.io
Contribution Guideline: see CONTRIBUTING.md
Weekly Meeting: Join us every Tuesday at 11:00AM (GMT+8:00) to discuss roadmap and features. Join here

Let's build the future of omni-modal RL together!