Command Palette
Search for a command to run...
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
Abstract
Training deep research agents—systems that plan, search, evaluate evidence, and synthesize long-form reports—pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RUBRICEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy training. RUBRICEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RUBRICEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RUBRICEM-8B achieves strong performance across four representative long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RUBRICEM.
One-sentence Summary
The authors introduce RUBRICEM, a rubric-guided reinforcement learning framework combining stagewise policy decomposition with reflection-based meta-policy training to optimize deep research agents beyond verifiable rewards by employing Stage-Structured GRPO for denser semantic feedback and distilling judged trajectories into reusable guidance, enabling RUBRICEM-8B to achieve strong performance across four representative long-form research benchmarks while outperforming comparable open models and approaching proprietary deep-research systems.
Key Contributions
- This work introduces RUBRICEM, a rubric-guided reinforcement learning framework that uses rubrics as a shared interface to structure policy execution, judge feedback, and agent memory. The method makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics.
- Credit assignment relies on Stage-Structured GRPO, which leverages stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. A shared-backbone reflection meta-policy runs asynchronously to distill judged trajectories into reusable rubric-grounded guidance without imposing sequential bottlenecks.
- Experiments demonstrate that RUBRICEM-8B achieves strong performance across four representative long-form research benchmarks. The resulting system outperforms comparable open models and approaches proprietary deep-research systems in these evaluations.
Introduction
Training deep research agents requires moving beyond verifiable rewards since their long form outputs lack ground truth answers and standard post training offers little reusable experience. Prior methods rely on verifiable search proxies or imitation data, leaving a gap in handling the coarse and delayed feedback of open ended research trajectories. The authors introduce RUBRICEM, a framework that treats rubrics as a shared interface to structure policy execution and agent memory. This system combines stagewise policy decomposition with reflection based meta policy training to provide denser semantic feedback and distill reusable guidance from judged trajectories.
Dataset
-
Dataset Composition and Sources
- The authors construct a supervised fine-tuning dataset comprising approximately 11,000 samples.
- Data originates from agent trajectories generated by a Gemini teacher model and adapted for Qwen3.
- The final volume is roughly 2,000 samples smaller than the DR Tulu baseline due to rigorous filtering.
-
Filtering and Rejection Rules
- Trajectories lacking a closing
</answer>tag are discarded as a hard reject. - Samples missing valid tool calls in non-final rounds are removed to prevent reliance on internal knowledge.
- Data without required XML structures like
<structured_plan>or<state_evaluation>is excluded. - Any trajectory with two or more consecutive tool errors is rejected to ensure reliability.
- Trajectories lacking a closing
-
Processing and Training Format
- Reasoning tags are converted from
<scratchpad>to<think>for template compatibility. - Tool names are normalized to canonical identifiers such as
google_search. - Each sample is formatted as a single-turn ChatML conversation with system, user, and assistant messages.
- Tool output tokens are masked during training so the model does not memorize search results.
- Reasoning tags are converted from
-
Evaluation Benchmarks
- Performance is measured on four long-form datasets including HealthBench and ResearchQA.
- These benchmarks range from 100 to 1,000 questions covering medical and scientific research domains.
Method
The RubricEM framework operates by treating rubrics as a shared interface across the agent's planning, execution, and learning phases. As illustrated in the framework diagram, the system integrates three core components: a rubric-guided structured trajectory for the task policy, a stage-structured reinforcement learning algorithm for credit assignment, and a reflection meta-policy for experience reuse.
Structured Reasoning Scaffold
To manage long-horizon research tasks, the authors impose an explicit stage structure on agent trajectories. This scaffold decomposes the generation process into four semantically distinct stages: Plan, Research, Review, and Answer. Each stage is marked by XML tags and governed by specific behavioral requirements. In the planning phase, the agent generates task-specific rubrics that define the criteria for success, including a knowledge checklist and negative constraints. These rubrics then guide the subsequent research and synthesis phases. The detailed workflow shows how the agent analyzes needs, defines grading criteria, and plans the search before executing tool calls.
This structure allows the policy to condition its decisions on the current stage, avoiding the aliasing of decision modes that occurs in flat autoregressive processes. A concrete example demonstrates how a query about sleep patterns is broken down into these stages, with specific rubrics generated during planning to direct the research and review steps.
Stage-Structured GRPO and Meta-Policy Training
Standard reinforcement learning methods often broadcast a single terminal reward to all tokens, which is inefficient for long-horizon tasks. RubricEM employs Stage-Structured GRPO (SS-GRPO) to provide finer-grained credit assignment. Instead of a single score, the LLM judge evaluates each stage (Plan, Research, Review, Answer) against stage-specific rubrics. These stagewise scores are combined using a causal stage-dependence matrix to compute returns that account for downstream impact. The training pipeline visualizes how task rollouts are judged to generate discriminative rubrics, which are then used to score the trajectories and update the policy.
Beyond optimizing the task policy, the framework explicitly trains a reflection meta-policy to reuse experience. The task policy and reflection meta-policy share the same backbone. After a task rollout is judged, the backbone samples rubric-grounded reflection candidates. A separate judge scores these candidates based on their utility for within-episode refinement and cross-episode transfer. The highest-scoring reflection is stored in a rubric bank, serving as natural-language memory for future queries. To ensure efficiency, the system uses an asynchronous execution pipeline where reflection generation and training run in parallel with task rollouts, avoiding sequential bottlenecks. The infrastructure diagram highlights this asynchronous design, showing how the training engine consumes deferred reflection batches while the inference engine generates new rollouts.
Experiment
The evaluation assesses RUBRICEM on four representative long-form benchmarks using an infrastructure adapted from DR Tulu. Results demonstrate that the proposed reinforcement learning recipe significantly improves performance over supervised fine-tuning and outperforms strong open baselines while remaining competitive with proprietary systems. Ablation studies validate that stagewise credit assignment and structured scaffolding contribute complementary gains, whereas inference-time experience reuse proves effective only with the learned meta-policy. Furthermore, the model exhibits strong generalization to short-form tasks, confirming that the training teaches transferable tool-use and evidence-grounding skills rather than just long-form report writing.
The the the table compares short-form search performance between the proposed RUBRICEM model and DR Tulu baselines. RUBRICEM demonstrates consistent improvements across all benchmarks, with the RL-fine-tuned version achieving the best overall results. RUBRICEM-8B (RL) achieves the highest average score, surpassing strong open baselines like DR Tulu-8B (RL). The model generalizes effectively to short-form tasks despite being trained primarily on long-form deep research data. RUBRICEM reaches superior performance using fewer RL training steps compared to the DR Tulu baseline.
The authors evaluate RUBRICEM against proprietary and open-source deep research models across multiple benchmarks. Results indicate that RUBRICEM-8B-RL achieves the highest performance among non-proprietary systems, surpassing strong open baselines like DR Tulu and Tongyi DeepResearch. Furthermore, the model demonstrates competitive capabilities against top-tier proprietary systems, particularly outperforming them on the DRB benchmark. RUBRICEM-8B-RL achieves the highest average performance among non-proprietary deep research systems evaluated. The Reinforcement Learning stage yields significant performance gains over the Supervised Fine-Tuning baseline. The model outperforms OpenAI Deep Research on the DRB benchmark while remaining competitive with other closed models overall.
The authors evaluate RUBRICEM against open-source and proprietary deep research models to validate its performance across short-form search and deep research benchmarks. Results indicate that the RL-fine-tuned version generalizes effectively from long-form training data, achieving top performance among non-proprietary systems. Additionally, the model surpasses strong open baselines and remains competitive with top-tier proprietary systems while requiring fewer reinforcement learning training steps.