6 hours ago

Table of Contents

Abstract

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.

One-sentence Summary

Hongrui Jia and Chaoya Jiang et al. propose Diagnostic-driven Progressive Evolution (DPE), a self-improving loop that diagnoses LMM weaknesses and generates targeted multimodal data for reinforcement, outperforming static training across eleven benchmarks and enabling scalable, continual LMM evolution under open-ended tasks.

Key Contributions

DPE introduces a diagnostic-driven training loop for Large Multimodal Models that identifies capability blind spots and dynamically generates targeted, weakness-focused data using multi-agent tool-augmented annotation, overcoming limitations of static datasets and heuristic-based evolution.
Applied to Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct, DPE achieves stable, continual improvements across eleven multimodal reasoning benchmarks using only 1000 training examples per iteration, demonstrating efficiency and scalability under open task distributions.
Systematic analysis confirms that DPE’s diagnosis mechanism enhances training stability and mitigates long-tail performance degradation, offering a principled approach to continual model improvement without relying on expensive human annotations or fixed data recipes.

Introduction

The authors leverage diagnostic feedback to address key limitations in training Large Multimodal Models (LMMs), where prior self-evolution methods rely on heuristic signals and static visual data, leading to unstable training and poor long-tail performance. Existing frameworks lack interpretable failure attribution and struggle to generate diverse, targeted multimodal samples, causing models to plateau or regress on complex tasks like math or OCR. Their main contribution is Diagnostic-driven Progressive Evolution (DPE), a closed-loop training paradigm that diagnoses model weaknesses, dynamically generates tailored multimodal data using multi-agent tool use, and reinforces improvements iteratively—resulting in stable, broad gains across benchmarks with minimal data.

Method

The authors leverage Diagnostic-driven Progressive Evolution (DPE), a closed-loop training framework designed to enhance large multimodal models (LMMs) under conditions of scarce supervision and long-tail coverage gaps. Unlike prior self-evolution methods that rely on static image sets and heuristic signals, DPE iteratively executes diagnosis, targeted generation, and reinforcement-based updating. Each iteration explicitly controls both the category composition and question emphasis of the training data, aligning resources with the model’s current capability blind spots to mitigate instability and diminishing returns on long-tail skills.

At iteration $k$ , the policy is denoted as $\pi_{\theta^{(k)}}$ . The framework constructs a training set $\mathcal{T}^{(k)}$ and updates parameters to $\theta^{(k+1)}$ via reinforcement learning with verifiable rewards:

\theta ^ { ( k + 1 ) } \! = \! \mathcal { A } _ { \mathrm { R L } } \! \Big ( \theta ^ { ( k ) } ; \, \mathcal { T } ^ { ( k ) } \Big ) , \mathcal { T } ^ { ( k ) } \! = \! \mathcal { A } _ { \mathrm { g e n } } \! \Big ( \mathcal { R } ^ { ( k ) } \Big ) , \mathcal { R } ^ { ( k ) } \! = \! \mathcal { A } _ { \mathrm { d i a g } } \! \Big ( \pi _ { \theta ^ { ( k ) } } \Big ) ,

where $\mathcal{A}_{\text{diag}}$ , $\mathcal{A}_{\text{gen}}$ , and $\mathcal{A}_{\text{RL}}$ represent the diagnosis, generation, and RL-update operators, respectively, and $\mathcal{R}^{(k)}$ is a structured diagnostic report.

The diagnostic mechanism initiates each iteration by performing explicit failure attribution and capability decomposition. It maps multimodal reasoning into a 12-dimensional capability space $C = \{c_1, c_2, \ldots, c_K\}$ , including categories such as geometry images, medical images, statistical charts, and natural scenes. From a diagnostic pool $\mathcal{D}_{\text{diag}}$ , the system samples $N = 200$ instances $\{(I_n, q_n, a_n, c_n)\}_{n=1}^{N}$ , and the model generates responses $\hat{y}_n \sim \pi_{\theta^{(k)}}(\cdot \mid I_n, q_n)$ . Diagnostic agents score each response using a function $v(\cdot)$ that evaluates both reasoning steps and final results, producing a scalar correctness signal $z_n$ . For each category $c$ , the system computes counts and accuracy:

N _ { c } = \sum _ { n = 1 } ^ { N } \mathbb { I } [ c _ { n } = c ] , \qquad \mathrm { A c c } _ { c } = \frac { 1 } { N _ { c } } \sum _ { n = 1 } ^ { N } \mathbb { I } [ c _ { n } = c ] \cdot z _ { n } .

Beyond accuracy, agents analyze the error set $\mathcal{E}_c = \{n \mid c_n = c, \ z_n = 0\}$ to summarize recurring failure patterns $\mathcal{F}_c$ , such as OCR misalignments or chart legend mismatches. These patterns are injected into the generation phase as executable prompts. The system then derives a category proportion vector $\alpha^{(k)}$ by assigning unnormalized weights $\tilde{\alpha}_c$ based on segmented accuracy ranges and normalizing:

\alpha _ { c } ^ { ( k ) } = \frac { \tilde { \alpha } _ { c } } { \sum _ { c ^ { \prime } = 1 } ^ { C } \tilde { \alpha } _ { c ^ { \prime } } } .

The final diagnostic report $\mathcal{R}^{(k)}$ includes $\alpha^{(k)}$ , $\{\mathcal{F}_c^{(k)}\}$ , and $\{\mathcal{H}_c^{(k)}\}$ , where $\mathcal{H}_c^{(k)}$ provides actionable generation instructions such as enforcing stricter answer formats or longer reasoning chains.

The Multiple Agents Questioner System translates $\mathcal{R}^{(k)}$ into a training dataset $\mathcal{T}^{(k)} = \{(I_j, q_j, a_j, c_j)\}_{j=1}^M$ with controllable distribution and verifiable answers. Given a target budget $M$ , the system enforces a hard category quota constraint: for each category $c$ , $m_c = \left\lfloor M \cdot \alpha_c^{(k)} \right\rfloor$ , and the final dataset must satisfy:

\sum _ { ( I , q , a , c ) \in \mathcal { T } ^ { ( k ) } } \mathbb { I } [ c = c ^ { \prime } ] = m _ { c ^ { \prime } } , \quad \forall c ^ { \prime } \in \{ 1 , \ldots , C \}.

The system comprises four agents: Planner, Image Selector, Question Generator, and Validation. The Planner Agent outputs a plan for each sample $j$ :

\mathrm { p l a n } _ { j } = \big ( c _ { j } , \ \mathrm { r e q } _ { j } ^ { I } , \ \mathrm { r e q } _ { j } ^ { Q } , \ \mathrm { d i r } _ { j } \big ),

where $c_j$ is the target category, $\mathrm{req}_j^I$ specifies image requirements, $\mathrm{req}_j^Q$ specifies question requirements, and $\mathrm{dir}_j$ targets weaknesses derived from $\mathcal{F}_{c_j}^{(k)}$ and $\mathcal{H}_{c_j}^{(k)}$ . The Image Selector Agent retrieves or composes images $I_j$ from an external pool $\mathcal{P}_{\text{ext}}$ using a pipeline $\phi(\cdot)$ that includes search, filtering, and editing capabilities. The Question Generator Agent produces $(q_j, a_j)$ given $I_j$ and planning instructions:

( q _ { j } , a _ { j } ) = \psi \big ( I _ { j } , \ \mathrm { r e q } _ { i } ^ { Q } , \ \mathcal { H } _ { c _ { i } } ^ { ( k ) } \big ).

The Validation Agent gates sample quality using four checks: category consistency, solvability, answer verifiability, and format compliance. The final acceptance condition is:

g ( s _ { i } ) = g _ { \mathrm { c a t } } \cdot g _ { \mathrm { s o l } } \cdot g _ { \mathrm { v e r } } \cdot g _ { \mathrm { f m t } }.

If $g(s_j) = 1$ , the sample is added to $\mathcal{T}^{(k)}$ and the quota state is updated; otherwise, it is discarded and regenerated.

Training proceeds via GRPO. For each prompt $x$ , the old policy $\pi_{\theta_{\text{old}}}$ generates $G$ trajectories $y_i = (o_{i,1}, \ldots, o_{i,|y_i|}) \sim \pi_{\theta_{\text{old}}}(\cdot \mid x)$ . Each trajectory receives a scalar reward $r_i = r(x, y_i)$ . GRPO optimizes the clipped surrogate objective:

\begin{array} { r l } { J _ { \mathrm { G R P O } } ( \theta ) = \mathbb { E } _ { x \sim \mathcal { D } , \, \{ y _ { i } \} \sim \pi _ { \theta _ { \mathrm { o l d } } } } \Bigg [ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | y _ { i } | } \sum _ { t = 1 } ^ { | y _ { i } | } \operatorname* { m i n } \bigg ( \rho _ { i , t } A _ { i , t } , } & { } \\\\ { \mathrm { c l i p } ( \rho _ { i , t } , 1 - \varepsilon , 1 + \varepsilon ) \, A _ { i , t } \bigg ) \; - \; \beta \, \mathrm { K L } \big ( \pi _ { \theta } \parallel \pi _ { \mathrm { i n i t } } \big ) \Bigg ] } & { } \end{array}

where $\rho_{i,t} = \frac{\pi_{\theta}(o_{i,t}|x,\sigma_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}|x,\sigma_{i,<t})}$ , $\varepsilon$ is the clipping threshold, $\beta > 0$ controls KL regularization, and $\pi_{\text{init}}$ is a reference policy. A key innovation is the group-normalized advantage:

\hat { A } _ { i } = \frac { r _ { i } - \mathrm { m e a n } ( r _ { 1 } , \ldots , r _ { G } ) } { \mathrm { s t d } ( r _ { 1 } , \ldots , r _ { G } ) } .

From a maximum-entropy perspective, the optimal policy satisfies $\pi ^ { * } ( y \mid x ) \propto \pi _ { \mathrm { i n i t } } ( y \mid x ) \exp ( r ( x , y ) / \beta )$ , and the KL divergence admits a lower bound:

\mathrm { K L } ( \pi _ { \mathrm { i n i t } } \parallel \pi ^ { * } ) \geq \frac { p ( x ) \big ( 1 - p ( x ) \big ) } { 2 \beta ^ { 2 } },

where $p(x)$ is the pass rate under $\pi_{\text{init}}$ . This bound is maximized near $p = 0.5$ , explaining why DPE retains only moderately difficult samples to improve learning efficiency.

At iteration $k$ , DPE generates and validates $\mathcal{T}^{(k)}$ , applies difficulty-aware filtering to obtain $\mathcal{T}_{\text{train}}^{(k)}$ , and performs GRPO to update the model: $\theta^{(k+1)} = \mathcal{A}_{\text{RL}}\left(\theta^{(k)}; \mathcal{T}_{\text{train}}^{(k)}\right)$ . The system then repeats the diagnostic round, progressively strengthening weak capabilities and expanding visual coverage through external image sources.

Experiment

DPE outperforms VisPlay in capability enhancement, training stability, and cross-model transferability, particularly excelling in STEM, OCR, and hallucination mitigation through a closed-loop diagnostic mechanism.
DPE achieves state-of-the-art results with parameter efficiency, surpassing larger models like Qwen2.5-VL-72B and GPT-4o in complex visual math and grounding tasks, highlighting the value of data quality over scale.
Ablation studies confirm DPE’s diagnostic module is essential for sustained improvement, preventing performance oscillation and guiding data generation toward true capability gaps.
DPE’s image retrieval and editing tools significantly expand visual diversity, preventing early plateaus and improving performance on OCR and math reasoning by covering long-tail visual patterns.
Generated data from DPE shows higher and more stable text and image diversity across iterations, avoiding template collapse and maintaining broad semantic and visual coverage.
Quality evaluations reveal DPE consistently produces high-quality, solvable, and visually grounded questions, while VisPlay’s output degrades over time, especially in correctness and structure.
Case studies illustrate DPE’s ability to generate complete, well-structured, and semantically grounded questions, unlike VisPlay’s incomplete or unanswerable examples.

The authors use a diagnostic-guided data evolution framework to iteratively improve vision-language models under low-data conditions, achieving consistent gains across diverse benchmarks including STEM, OCR, and hallucination mitigation. Results show that their method sustains stable performance growth across iterations while outperforming self-evolving baselines and larger state-of-the-art models, particularly in complex reasoning and grounding tasks. The approach proves effective across model scales and relies on targeted data generation rather than volume, with diagnostic feedback ensuring continuous alignment with model weaknesses.

The authors use a multi-agent system to generate training data iteratively, with DPE consistently producing higher-quality questions than VisPlay across all iterations, particularly in solvability and correctness. Results show DPE maintains stable, near-ceiling quality scores while VisPlay’s quality degrades over time, indicating DPE’s diagnostic guidance effectively sustains data reliability. This quality advantage directly supports more stable and effective model evolution compared to self-evolving baselines.

The authors use DPE to generate training data with higher and more stable text and image diversity compared to VisPlay, as measured by mean pairwise cosine distance across iterations. Results show DPE sustains diversity gains over time while VisPlay exhibits degradation, particularly in later iterations, indicating DPE’s mechanisms better prevent distribution collapse and template reversion. This enhanced diversity supports broader semantic and visual coverage, contributing to more robust model performance.

The authors use DPE to enhance Qwen3-VL-8B-Instruct under low-data conditions, achieving state-of-the-art performance across multiple benchmarks including visual math and hallucination mitigation. Results show DPE outperforms larger models like Qwen2.5-VL-72B and GPT-4o in key areas, demonstrating that targeted data generation and diagnostic feedback yield stronger gains than parameter scale alone. The method sustains stable improvements across iterations by focusing on model weaknesses and maintaining high data quality and diversity.

The authors use DPE to iteratively generate high-quality training data from a small seed set, achieving performance gains over static training despite using only 3K samples. Results show consistent improvements across multiple benchmarks, including MMMU, HallusionBench, MathVista, and RealWorldQA, indicating that targeted data generation based on diagnostic feedback enhances model capabilities more effectively than larger static datasets. The method demonstrates stable training dynamics and superior data efficiency, with gains sustained across iterations without performance regression.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 hours ago

Hongrui Jia Chaoya Jiang Shikun Zhang Wei Ye

Table of Contents

Abstract

One-sentence Summary

Key Contributions

DPE introduces a diagnostic-driven training loop for Large Multimodal Models that identifies capability blind spots and dynamically generates targeted, weakness-focused data using multi-agent tool-augmented annotation, overcoming limitations of static datasets and heuristic-based evolution.
Applied to Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct, DPE achieves stable, continual improvements across eleven multimodal reasoning benchmarks using only 1000 training examples per iteration, demonstrating efficiency and scalability under open task distributions.
Systematic analysis confirms that DPE’s diagnosis mechanism enhances training stability and mitigates long-tail performance degradation, offering a principled approach to continual model improvement without relying on expensive human annotations or fixed data recipes.

Introduction

Method

\theta ^ { ( k + 1 ) } \! = \! \mathcal { A } _ { \mathrm { R L } } \! \Big ( \theta ^ { ( k ) } ; \, \mathcal { T } ^ { ( k ) } \Big ) , \mathcal { T } ^ { ( k ) } \! = \! \mathcal { A } _ { \mathrm { g e n } } \! \Big ( \mathcal { R } ^ { ( k ) } \Big ) , \mathcal { R } ^ { ( k ) } \! = \! \mathcal { A } _ { \mathrm { d i a g } } \! \Big ( \pi _ { \theta ^ { ( k ) } } \Big ) ,

N _ { c } = \sum _ { n = 1 } ^ { N } \mathbb { I } [ c _ { n } = c ] , \qquad \mathrm { A c c } _ { c } = \frac { 1 } { N _ { c } } \sum _ { n = 1 } ^ { N } \mathbb { I } [ c _ { n } = c ] \cdot z _ { n } .

\alpha _ { c } ^ { ( k ) } = \frac { \tilde { \alpha } _ { c } } { \sum _ { c ^ { \prime } = 1 } ^ { C } \tilde { \alpha } _ { c ^ { \prime } } } .

\sum _ { ( I , q , a , c ) \in \mathcal { T } ^ { ( k ) } } \mathbb { I } [ c = c ^ { \prime } ] = m _ { c ^ { \prime } } , \quad \forall c ^ { \prime } \in \{ 1 , \ldots , C \}.

The system comprises four agents: Planner, Image Selector, Question Generator, and Validation. The Planner Agent outputs a plan for each sample $j$ :

\mathrm { p l a n } _ { j } = \big ( c _ { j } , \ \mathrm { r e q } _ { j } ^ { I } , \ \mathrm { r e q } _ { j } ^ { Q } , \ \mathrm { d i r } _ { j } \big ),

( q _ { j } , a _ { j } ) = \psi \big ( I _ { j } , \ \mathrm { r e q } _ { i } ^ { Q } , \ \mathcal { H } _ { c _ { i } } ^ { ( k ) } \big ).

The Validation Agent gates sample quality using four checks: category consistency, solvability, answer verifiability, and format compliance. The final acceptance condition is:

g ( s _ { i } ) = g _ { \mathrm { c a t } } \cdot g _ { \mathrm { s o l } } \cdot g _ { \mathrm { v e r } } \cdot g _ { \mathrm { f m t } }.

If $g(s_j) = 1$ , the sample is added to $\mathcal{T}^{(k)}$ and the quota state is updated; otherwise, it is discarded and regenerated.

\begin{array} { r l } { J _ { \mathrm { G R P O } } ( \theta ) = \mathbb { E } _ { x \sim \mathcal { D } , \, \{ y _ { i } \} \sim \pi _ { \theta _ { \mathrm { o l d } } } } \Bigg [ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | y _ { i } | } \sum _ { t = 1 } ^ { | y _ { i } | } \operatorname* { m i n } \bigg ( \rho _ { i , t } A _ { i , t } , } & { } \\\\ { \mathrm { c l i p } ( \rho _ { i , t } , 1 - \varepsilon , 1 + \varepsilon ) \, A _ { i , t } \bigg ) \; - \; \beta \, \mathrm { K L } \big ( \pi _ { \theta } \parallel \pi _ { \mathrm { i n i t } } \big ) \Bigg ] } & { } \end{array}

\hat { A } _ { i } = \frac { r _ { i } - \mathrm { m e a n } ( r _ { 1 } , \ldots , r _ { G } ) } { \mathrm { s t d } ( r _ { 1 } , \ldots , r _ { G } ) } .

\mathrm { K L } ( \pi _ { \mathrm { i n i t } } \parallel \pi ^ { * } ) \geq \frac { p ( x ) \big ( 1 - p ( x ) \big ) } { 2 \beta ^ { 2 } },

where $p(x)$ is the pass rate under $\pi_{\text{init}}$ . This bound is maximized near $p = 0.5$ , explaining why DPE retains only moderately difficult samples to improve learning efficiency.

Experiment

DPE outperforms VisPlay in capability enhancement, training stability, and cross-model transferability, particularly excelling in STEM, OCR, and hallucination mitigation through a closed-loop diagnostic mechanism.
DPE achieves state-of-the-art results with parameter efficiency, surpassing larger models like Qwen2.5-VL-72B and GPT-4o in complex visual math and grounding tasks, highlighting the value of data quality over scale.
Ablation studies confirm DPE’s diagnostic module is essential for sustained improvement, preventing performance oscillation and guiding data generation toward true capability gaps.
DPE’s image retrieval and editing tools significantly expand visual diversity, preventing early plateaus and improving performance on OCR and math reasoning by covering long-tail visual patterns.
Generated data from DPE shows higher and more stable text and image diversity across iterations, avoiding template collapse and maintaining broad semantic and visual coverage.
Quality evaluations reveal DPE consistently produces high-quality, solvable, and visually grounded questions, while VisPlay’s output degrades over time, especially in correctness and structure.
Case studies illustrate DPE’s ability to generate complete, well-structured, and semantically grounded questions, unlike VisPlay’s incomplete or unanswerable examples.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia Chaoya Jiang Shikun Zhang Wei Ye

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia Chaoya Jiang Shikun Zhang Wei Ye

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia Chaoya Jiang Shikun Zhang Wei Ye

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters