Command Palette
Search for a command to run...
InterleaveThinker: Reinforcing Agentic Interleaved Generation
InterleaveThinker: Reinforcing Agentic Interleaved Generation
Dian Zheng Harry Lee Manyuan Zhang Kaituo Feng Zoey Guo Ray Zhang Hongsheng Li
Abstract
Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.
One-sentence Summary
InterleaveThinker is a multi-agent pipeline that equips existing image generators with interleaved text-image sequence generation by coordinating a planner agent to structure stepwise instructions and a critic agent to evaluate outputs and refine subsequent prompts, with stepwise instruction correction within generation trajectories reinforced through GRPO to address the architectural constraints of prior unified multimodal models in visual narratives and embodied manipulation.
Key Contributions
- The paper introduces InterleaveThinker, a multi-agent pipeline that retrofits frozen image generators with interleaved text-image sequence generation without modifying their base architectures. A planner agent structures the execution steps while a critic agent evaluates outputs, identifies deviations, and refines prompts to ensure strict trajectory adherence.
- Training is enabled through three curated datasets, Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k for format cold-starting, and Interleave-Critic-RL-13k for reinforcement learning. A GRPO-based optimization with a dual-reward strategy comprising accuracy and step-wise rewards efficiently aligns long-horizon generation trajectories at reduced computational cost.
- Evaluated on off-the-shelf generators such as FLUX.2-klein, the framework surpasses open-source unified multimodal models on interleaved generation benchmarks and matches proprietary systems like Nano Banana and GPT-5. The approach also substantially improves reasoning performance on the WISE benchmark (0.47 to 0.73) and the RISE benchmark (13.3 to 28.9).
Introduction
Modern image generation models excel at single-image synthesis, yet practical applications like visual storytelling and embodied manipulation demand interleaved generation that seamlessly alternates text and image outputs. While Unified Multimodal Models attempt to support this workflow, they frequently exhibit visual over-reliance on intermediate states and suffer from compounding step-by-step errors during extended sequences. To address these limitations, the authors propose InterleaveThinker, a multi-agent framework that retrofits frozen image generators with robust sequential capabilities. The system utilizes a Planner agent to forecast complete instruction trajectories upfront, effectively bypassing premature visual dependency, while a Critic agent evaluates outputs and refines prompts to prevent error accumulation. By combining this architecture with a curated training dataset and a dual-reward reinforcement learning strategy, the authors achieve trajectory-level alignment that matches proprietary models and significantly enhances base model reasoning.
Dataset
- Dataset Composition and Sources: The authors generate roughly 40,000 text prompts through a top-down pipeline that starts with 8 broad domains, expands to 75 fine-grained subcategories, and leverages Gemini 2.5 Pro to build domain-specific vocabulary banks and instructional templates. Multi-agent trajectory generation combines Gemini 2.5 Pro and Nano Banana Pro, with FLUX.2-klein-9B added to balance visual quality and prevent critic bias. The final corpus also integrates existing open-source interleaved datasets to supplement planner training.
- Subset Details and Filtering: Interleave-Critic-SFT-112k contains 112,000 samples filtered for successful refinement trends, stable high scores, and low iteration score variance. Interleave-Critic-RL-13k holds 13,000 samples selected for high score variance to capture dynamic refinement processes, maintaining a strict 2:1 ratio with the SFT subset. Interleave-Planner-SFT-80k comprises 80,000 samples that bypass critic filtering entirely, preserving the original unfiltered trajectories for planner training.
- Training Splits and Processing: The pipeline decomposes full trajectories into independent step-wise segments to enable stable single-iteration optimization instead of computationally prohibitive end-to-end reinforcement learning. Each refinement step is scored from 0 to 10 for semantic alignment and visual quality using Gemini 2.5 Pro adapted from VIEScore. The authors apply targeted resampling to balance the binary judgment distribution for the critic, ensuring unbiased training across iteration-wise predictions.
- Metadata and Structural Processing: Planner training pairs are constructed by randomly truncating interleaved text-image sequences, where the preceding context serves as input and the subsequent text plan acts as the target output. Metadata explicitly tracks original user instructions, rewritten refinement prompts, and paired original versus generated images to support step-wise evaluation. The filtering pipeline discards steps exhibiting negative refinement trends or persistent low quality, retaining only those that demonstrate successful iterative improvement.
Experiment
The evaluation employs a multi-agent InterleaveThinker framework to validate performance across interleaved generation and reasoning-based editing benchmarks using both in-domain and generalization image models. Results demonstrate that the approach significantly outperforms existing open-source methods by effectively mitigating visual over-reliance and step-wise error accumulation while preserving textual fidelity and image quality. Ablation studies confirm that the dedicated planner-critic architecture, fine-tuned training stages, and closed-loop refinement process are essential for robust performance, as single-model or unfiltered alternatives consistently degrade results. Although the framework encounters limitations with out-of-domain concepts unknown to the base generator, it remains a highly generalizable and model-agnostic solution for complex multimodal tasks.