HyperAIHyperAI

Command Palette

Search for a command to run...

REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

Abstract

Large language models are transitioning from generalpurpose knowledge engines to realworld problem solvers, yet optimizing them for deep search tasks remains challenging. The central bottleneck lies in the extreme sparsity of highquality search trajectories and reward signals, arising from the difficulty of scalable longhorizon task construction and the high cost of interactionheavy rollouts involving external tool calls. To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization. Specifically, REDSearcher introduces the following improvements: (1) We frame task synthesis as a dualconstrained optimization, where task difficulty is precisely governed by graph topology and evidence dispersion, allowing scalable generation of complex, highquality tasks. (2) We introduce toolaugmented queries to encourage proactive tool use rather than passive recall.(3) During midtraining, we strengthen core atomic capabilities knowledge, planning, and function calling substantially reducing the cost of collecting highquality trajectories for downstream training. (4) We build a local simulated environment that enables rapid, lowcost algorithmic iteration for reinforcement learning experiments. Across both textonly and multimodal searchagent benchmarks, our approach achieves stateoftheart performance. To facilitate future research on longhorizon search agents, we will release 10K highquality complex text search trajectories, 5K multimodal trajectories and 1K text RL query set, and together with code and model checkpoints.

One-sentence Summary

The REDSearcher team proposes a unified framework for optimizing search agents by co-designing task synthesis, mid-training, and post-training, using graph-constrained task generation and tool-augmented queries to reduce reliance on costly real-world rollouts, achieving SOTA across text and multimodal benchmarks while releasing 16K trajectories and code.

Key Contributions

  • REDSearcher addresses the scarcity of high-quality search trajectories by synthesizing complex tasks via dual constraints—graph treewidth for logical complexity and evidence dispersion—enabling scalable generation of long-horizon reasoning problems that demand iterative planning and cross-document synthesis.
  • It introduces tool-augmented queries and mid-training reinforcement of core capabilities (knowledge, planning, function calling) to promote proactive tool use and reduce the cost of collecting high-quality trajectories, while a local simulated environment enables rapid, low-cost RL experimentation.
  • Evaluated on text-only and multimodal benchmarks, REDSearcher achieves state-of-the-art performance and releases 10K text, 5K multimodal search trajectories, and 1K RL queries to support future research on deep search agents.

Introduction

The authors leverage large language models to tackle long-horizon search tasks—where agents must plan, retrieve, and synthesize information across multiple steps and sources—but note that prior work struggles with sparse high-quality training data and prohibitive costs from live tool interactions. Existing datasets often lack structural complexity and rely on simplistic, linear reasoning, while real-world search demands handling cyclic or fully coupled constraints that require maintaining entangled hypotheses. REDSearcher addresses this by co-designing task synthesis, mid-training, and reinforcement learning: it generates complex tasks using treewidth-guided graph topology and evidence dispersion, injects tool-augmented queries to promote proactive tool use, strengthens core subskills early to reduce rollout costs, and deploys a simulated environment for rapid RL iteration. The result is a scalable, cost-efficient framework that achieves state-of-the-art performance on both text and multimodal search benchmarks, backed by public releases of 16K high-quality trajectories and training artifacts.

Dataset

  • The authors construct a synthetic dataset to train deep search agents capable of handling multi-hop, ambiguous, and non-linear queries—tasks that demand iterative tool use and evidence synthesis, which existing open-source datasets lack.

  • The dataset is generated via a scalable, controllable synthesis pipeline that combines signals from local knowledge bases and cached webpages, intentionally increasing difficulty through fuzzing and complex constraints.

  • To ensure quality and challenge, the authors apply a five-stage verifier pipeline:

    1. LLM solver pre-filter: removes instances solvable without tools.
    2. Retrievability check: filters out questions whose answers don’t appear in top-50 search snippets.
    3. Hallucination/inconsistency check: uses an LLM verifier to detect contradictions between evidence and question-answer pairs.
    4. Agent rollout verification: runs strong tool-using agents across multiple rollouts; keeps instances where at least one rollout succeeds and records pass rate as confidence.
    5. Answer uniqueness check: discards instances with plausible alternative answers to reduce ambiguity.
  • A quality study confirms 85%+ of 500 human-verified instances are logically consistent and grounded; a strong model (DeepSeek-V3.2) achieves ~40% accuracy, while humans solve 47% within 30 minutes—validating the dataset’s realistic difficulty.

  • For training, the authors generate multi-turn tool-calling data simulating ReACT loops using LLMs to create tool sets, queries, and environmental feedback—avoiding costly real API calls.

  • Long-horizon interaction trajectories (up to 128K context) are synthesized using a local simulated web environment built from Wikipedia and web crawl dumps, ensuring solvability and enabling training on complex, multi-step search tasks.

  • The dataset includes highly intricate, real-world-inspired questions requiring cross-domain reasoning, such as identifying record pressing plants, healthcare facilities, racing events, and historical sites—all grounded in synthesized but plausible evidence.

  • No cropping is applied; metadata is constructed implicitly through the synthesis pipeline, embedding grounding signals (e.g., KB triples, cached passages) and confidence metrics (e.g., agent pass rates) for each instance.

Method

The authors leverage a structured, multi-phase training framework to develop REDSearcher, a tool-augmented agent capable of deep, long-horizon search across text and multimodal domains. The architecture is built upon a scalable task synthesis pipeline, a two-stage mid-training regimen, and a post-training phase that combines supervised fine-tuning with reinforcement learning. Each component is designed to address the sparsity of supervision and the computational cost of real-world interaction.

The core of the method begins with the scalable task synthesis pipeline, which generates complex, verifiable QA pairs by constructing reasoning graphs with controlled structural and distributional complexity. As shown in the figure below, the pipeline initiates with a seed entity set drawn from Wikipedia, from which a directed acyclic graph (DAG) is built using both structured Wikidata relations and web-based hyperlink traversal. This graph is then enriched by an LLM-driven agent to introduce cycles and interlocking constraints, increasing the treewidth and forcing the solver to maintain multiple hypotheses. Subgraph sampling extracts multiple reasoning contexts from each master graph, and an LLM generates natural-language questions anchored to these topologies. A critical innovation is the tool-enforced query evolution: static entities are replaced with operational constraints (e.g., routing queries or citation-based lookups) that require external tool invocation, ensuring that successful completion is contingent on tool use.

To ensure quality and difficulty, a verifier pipeline filters out solvable instances. The LLM solver checks for hallucinations and API retrievability, while an agent solver performs n-rollouts to validate answer consistency. Only QA pairs that survive this multi-stage filtering are retained for training. For multimodal tasks, the pipeline injects visual constraints by anchoring intermediate nodes to images and enforcing cross-modal dependencies, ensuring that visual understanding is necessary for task completion.

The training process is divided into two major phases: mid-training and post-training. As illustrated in the framework diagram, the model begins from an open-source LLM checkpoint and progresses through atomic capability acquisition (Stage 1, 32K context) and composite capability development (Stage 2, 128K context) during mid-training. This is followed by agentic supervised fine-tuning (SFT) and reinforcement learning (RL) in the post-training phase.

Mid-training is further decomposed into two phases. Phase I focuses on internal cognitive optimization: intent-anchored grounding, which teaches the model to extract relevant facts from noisy web pages under specific query intents, and hierarchical planning, which enables decomposition of ambiguous goals into concrete subtasks. Phase II introduces external environmental interaction, where the model learns to execute tool calls and maintain state across long-horizon trajectories. This staged approach allows the model to warm-start with foundational skills before engaging in costly real-world rollouts.

In post-training, the model undergoes supervised fine-tuning on high-quality ReAct-style trajectories generated in real-world environments using five tool interfaces: search, visit, Python interpreter, Google Scholar, and Google Maps. The SFT objective masks environment observations to prevent gradient contamination. Subsequently, agentic reinforcement learning is applied using GRPO with verifiable rewards. The reward is binary (0/1) based on answer correctness, and advantages are normalized within groups of rollouts per question to stabilize training. To accelerate experimentation, a functionally equivalent simulation environment is used during RL, which mimics real APIs while ensuring evidence completeness and injecting realistic noise. The simulation environment is built from cached web data and includes URL obfuscation to prevent model bias. Asynchronous rollouts and a two-tier load balancing strategy are employed to handle the computational demands of long trajectories.

The entire framework is designed to scale efficiently: task synthesis reuses graphs to amortize LLM costs, mid-training avoids real-world interaction until necessary, and RL leverages a curated, agent-verified query set to ensure clean learning signals. The result is a deep-search agent that can iteratively acquire evidence, maintain hypotheses, and synthesize information across multiple sources and modalities.

Experiment

  • REDSearcher sets a new state-of-the-art among open-source 30B-parameter agents, outperforming both open and proprietary models on complex benchmarks like GAIA, demonstrating superior parameter efficiency and deep research capability.
  • Mid-training stages progressively enhance performance: Stage I improves grounding and planning, especially on GAIA; Stage II enables robust tool use and long-horizon execution, significantly boosting performance on BrowseComp-ZH.
  • Reinforcement learning further refines capabilities, improving overall scores and reducing tool usage by 10.4% without sacrificing accuracy, indicating more efficient and strategic search behavior.
  • Tool-use analysis reveals REDSearcher relies minimally on parametric knowledge, excelling only when tools are enabled—highlighting strong planning, evidence synthesis, and iterative reasoning over memorization.
  • Multimodal experiments show strong performance across vision-language benchmarks, outperforming large proprietary models and a Qwen3-VL baseline, with capabilities transferring well to text-only tasks.
  • Analysis of tool usage patterns shows adaptive behavior: simpler tasks require fewer turns, while complex ones involve more decomposition, reflection, and verification; RL training reduces unnecessary search steps, especially on easier benchmarks.

The authors use a staged mid-training approach to progressively enhance the model’s agentic capabilities, with each stage building upon the last to improve performance across multiple benchmarks. Results show consistent gains in average scores as the model advances through grounding, planning, and agentic interaction phases, particularly on complex tasks like GAIA and BrowseComp-ZH. This structured training strategy effectively bridges the gap between understanding and action, enabling more robust and goal-consistent behavior in deep search scenarios.

The authors evaluate their multimodal search agent, REDSearcher-MM, across diverse benchmarks and find it outperforms both proprietary and open-source baselines, particularly on complex tasks requiring visual grounding and long-horizon reasoning. Results show consistent gains after reinforcement learning, with improved efficiency in tool usage and stronger performance on challenging multimodal benchmarks like MM-BrowseComp and LiveVQA. The model also demonstrates robust transferability, maintaining strong results on text-only tasks despite being optimized for multimodal inputs.

The authors use a 30B-parameter model with context management to achieve state-of-the-art performance among open-source agents, outperforming larger proprietary models on key benchmarks including GAIA. Results show that their approach delivers superior deep research capabilities through efficient tool use and multimodal reasoning, even when compared to significantly larger baselines. Reinforcement learning further enhances performance by refining search efficiency and reducing redundant tool calls without sacrificing accuracy.

The authors use a 30B-parameter model with context management to achieve state-of-the-art performance among open-source agents, outperforming larger proprietary models on complex reasoning benchmarks like GAIA. Results show that progressive mid-training stages and reinforcement learning significantly improve long-horizon search efficiency, reducing tool calls while maintaining or increasing accuracy. The model also demonstrates strong multimodal search capabilities, effectively integrating visual and textual evidence across diverse benchmarks.

The authors use REDSearcher to evaluate performance across multiple challenging benchmarks, including BrowseComp, GAIA, and HLE, comparing it against both open-source and proprietary models. Results show that REDSearcher achieves competitive or superior scores relative to larger proprietary systems, particularly excelling on GAIA, which tests complex agentic reasoning. The model’s strong performance is attributed to its architecture and training methodology, including context management and reinforcement learning, which enhance efficiency and long-horizon task execution.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp