HyperAIHyperAI

Command Palette

Search for a command to run...

Self-Improving Language Models with Bidirectional Evolutionary Search

Guowei Xu Zhenting Qi Huangyuan Su Weirui Ye Himabindu Lakkaraju Sham M. Kakade Yilun Du

Abstract

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.

One-sentence Summary

The authors propose Bidirectional Evolutionary Search (BES), a framework for self-improving language models that overcomes the sparse verification and restricted exploration of conventional tree search by coupling forward evolutionary recombination of partial trajectories with backward recursive task decomposition, theoretically enabling candidates to escape narrow entropy shells and exponentially reducing the samples required to find correct answers.

Key Contributions

  • Bidirectional Evolutionary Search (BES) couples forward candidate evolution with backward goal decomposition to enhance self-improvement in language models and agentic systems. The forward search applies combination, translocation, deletion, and crossover operators to recombine partial trajectories, while the backward search recursively decomposes tasks into checkable sub-goals that yield dense intermediate feedback.
  • Theoretical analysis demonstrates that standard autoregressive expansion confines candidates to a narrow entropy shell, whereas evolutionary operators escape this constraint. Backward goal decomposition is further shown to exponentially reduce the sample complexity required to locate correct solutions.
  • Empirical evaluations across post-training and inference settings show that BES consistently discovers high-quality training samples on logical and multi-hop reasoning tasks where GRPO, MaxRL, and Tree-GRPO fail to improve base models. On three open problem-solving benchmarks, BES discovers more stable and higher-quality solutions than OpenEvolve, GEPA, and ShinkaEvolve.

Introduction

Large language models and agentic systems have achieved remarkable performance on complex reasoning and code generation tasks, making efficient sampling a critical factor for both post-training self-improvement and inference-time test-time scaling. Current approaches primarily rely on best-of-N sampling and tree search, but these methods face two fundamental challenges. First, they depend on sparse verification signals that offer only binary or coarse-grained feedback, which weakens search guidance. Second, because they generate candidates through auto-regressive expansion, they remain confined to the model's existing distribution and struggle to explore the low-probability regions where optimal solutions typically reside. To overcome these bottlenecks, the authors introduce Bidirectional Evolutionary Search, which combines forward candidate generation with backward decomposition into verifiable sub-goals to deliver dense feedback. They further adapt evolutionary biology by applying recombination and mutation operators to splice together segments from different reasoning trajectories, enabling the model to escape narrow entropy shells and systematically discover higher-quality solutions.

Dataset

  • Dataset composition and sources: The authors compile a benchmark suite for open problem solving tasks, pairing each task with specialized prompts tailored for their evolutionary search system. The data draws from established coding benchmarks and centers on iterative program improvement.

  • Key details for each subset: The collection features a backward search decomposition prompt that extracts verifiable sub goals from goal tree leaves, alongside four evolution operation prompts labeled DIFF, DIFF_ABLATE, FULL, and CROSS. Each benchmark also records the structure of the optimal program discovered during evaluation.

  • Data usage and processing: The authors employ these prompts to drive both backward search decomposition and forward search mutations. All prompt templates rely on standard Python string formatting placeholders to dynamically inject code content, performance metrics, and textual feedback into the model context.

  • Output structure and metadata: Generated responses follow a strict JSON array format containing fields for entry type, descriptive rationale, verification code, and expected results. Input contexts are carefully structured to isolate current program states, parent code skeletons, and associated metrics, ensuring clear traceability throughout the optimization process.

Method

The Bidirectional Evolutionary Search (BES) framework integrates a forward search with a backward search to overcome the limitations of existing methods that rely on sparse verification signals and constrained candidate generation. The overall architecture, as illustrated in the framework diagram, consists of two coupled processes that iteratively refine candidate solutions. The forward search expands and recombines partial trajectories to generate new candidates, while the backward search decomposes the problem into verifiable sub-goals to provide dense intermediate feedback. This bidirectional interaction allows the forward search to explore a broader solution space guided by rich, structured feedback, enabling the discovery of solutions that are difficult to obtain through autoregressive expansion alone.

The forward search operates on a candidate set P\mathcal{P}P of partial trajectories, where each trajectory is a sequence of steps (e.g., reasoning segments, actions). At each search step, the algorithm applies one of two types of operators to a parent node to produce a child node. Expansion extends a parent trajectory by sampling new steps from the policy πθ\pi_\thetaπθ, as shown in Figure 2(a). This is the standard autoregressive method. In contrast, evolution operators recombine existing trajectories to generate new candidates, thereby escaping the narrow entropy shell of the policy. As detailed in Figure 2, the four evolution operators are: (i) Combination, which merges two trajectories by concatenating their suffixes beyond a shared prefix; (ii) Deletion, which removes an interior step from a trajectory; (iii) Translocation, which replaces a step in one trajectory with a step from another; and (iv) Crossover, which splices the prefix of one trajectory onto the tail of another. The selection of an operator and its parent(s) is guided by a Boltzmann distribution over the backward score, with the temperature annealed over the search budget to shift from exploration to exploitation.

The backward search addresses the sparsity of the final verifier signal by recursively decomposing the original problem into a tree of fine-grained sub-goals. This process, shown in the framework diagram, transforms the problem into a hierarchy where each sub-goal is verifiable. The backward search calculates a score for each forward node by recursively traversing this goal tree. A node's score is a weighted combination of its local verification against a sub-goal and the average score of its children. This provides a dense, interpretable feedback signal that guides the forward search. For instance, a candidate that correctly identifies the artist in a multi-hop reasoning problem but fails to find the record label receives a partial score, allowing it to be selected as a parent for further refinement. The pair score, which measures the joint coverage of a goal tree by two nodes, further encourages the selection of complementary parents.

The BES algorithm alternates between forward and backward steps. The forward search maintains a pool of candidate trajectories and applies expansion or evolution operators. After a fixed number of forward steps, the backward search is invoked to refine the goal tree by decomposing unsolved leaf sub-goals and re-scoring all candidates. This iterative process continues until a terminal candidate with a high score is found or the compute budget is exhausted. The framework's effectiveness is underpinned by theoretical results showing that evolution operators can escape the entropy shell of expansion-only search, and that backward sub-goal decomposition provides an exponential advantage in finding a solution by transforming a multiplicative problem into a sub-goal collection problem. The full pseudocode details the main loop, which orchestrates these components.

Experiment

The experiments evaluate BES across post-training and inference settings, using logical and multi-hop reasoning tasks to validate sample discovery and agent search behavior, alongside open problem-solving benchmarks to assess solution quality and search stability. Qualitatively, BES consistently outperforms existing baselines by effectively identifying high-quality training samples and guiding models to actively engage in multi-step reasoning rather than relying on shortcuts or random guessing. The framework demonstrates robust performance with notably lower variance across runs and maintains reasonable computational overhead, while ablation studies confirm that both the bidirectional search mechanism and evolution operators are essential to its success. Overall, BES provides a reliable and efficient search framework that enhances reasoning capabilities and solution discovery across diverse tasks.

The authors evaluate BES on open problem solving benchmarks, comparing it against open-source frameworks and high-performing closed-source methods. Results show that BES achieves competitive performance across all tasks, often matching or exceeding the best results from other frameworks while demonstrating lower variance and higher stability. BES also outperforms base frameworks in average and best objective values, particularly in Circle Packing and Heilbronn Convex problems. BES achieves competitive performance on open problem solving benchmarks, matching or exceeding results from established open-source frameworks and high-performing closed-source methods. BES demonstrates lower variance across runs compared to other frameworks, indicating more stable and reliable search behavior. BES outperforms base frameworks in both average and best objective values, particularly in Circle Packing and Heilbronn Convex problems.

The authors evaluate BES in both post-training and inference settings, demonstrating its effectiveness in improving logical and multi-hop reasoning through better sampling and search strategies. Results show that BES consistently outperforms baselines in accuracy, search behavior, and stability, while maintaining reasonable computational overhead. The framework's components are validated through ablation studies and cost analysis, confirming the importance of bidirectional search and evolution operators. BES achieves substantial improvements in accuracy and search behavior compared to baselines in multi-hop reasoning tasks. BES outperforms open-source frameworks in open problem solving benchmarks with lower variance and higher stability. Ablation studies confirm that both bidirectional search and evolution operators are essential for BES's effectiveness.

The authors evaluate BES on multi-hop reasoning post-training using the MuSiQue dataset, comparing it against baselines including GRPO and Tree-GRPO. Results show that BES achieves higher accuracy and significantly improves the number of valid search actions and finish ratios compared to all baselines. The method consistently outperforms other approaches across both model sizes, indicating better learning of search behavior and more effective training. BES achieves higher accuracy and better performance on search-related metrics compared to GRPO and Tree-GRPO. BES improves the number of valid search actions and finish ratios, indicating more effective learning of search behavior. BES outperforms baselines on both model scales, demonstrating consistent gains across different model sizes.

The authors evaluate BES on multi-hop reasoning post-training using the MuSiQue dataset with two model sizes. Results show that BES significantly improves accuracy over baselines, with substantial gains on both model scales. BES also achieves higher numbers of valid search actions and better finish ratios, indicating more effective search behavior compared to the baselines. BES achieves higher accuracy than baselines on both model scales, with notable improvements over GRPO and Tree-GRPO. BES produces more valid search actions and higher finish ratios, indicating more effective search behavior. BES outperforms baselines across all metrics, demonstrating superior performance in multi-hop reasoning tasks.

The authors evaluate BES on post-training and inference tasks, comparing it against baselines such as GRPO and Tree-GRPO. Results show that BES consistently improves performance across logical reasoning, multi-hop reasoning, and open problem solving benchmarks, with stable and reliable search behavior. The ablation study confirms the importance of both bidirectional search and answer reweighting components in BES. BES achieves consistent improvements over baselines in both post-training and inference tasks, demonstrating superior sampling and search capabilities. BES shows stable performance with lower variance across runs compared to open-source frameworks, indicating more reliable and consistent search behavior. Ablation studies confirm that both bidirectional search and answer reweighting are essential components contributing to the effectiveness of BES.

The authors evaluate BES across open problem solving benchmarks and multi-hop reasoning tasks in both post-training and inference settings, comparing it against established open-source, closed-source, and baseline methods. These experiments validate that BES consistently enhances logical and multi-hop reasoning by learning more effective sampling and search strategies, resulting in significantly more reliable and stable performance across different model scales. Ablation studies further confirm that bidirectional search, evolution operators, and answer reweighting are essential components driving these improvements. Overall, BES proves to be a robust framework that outperforms existing approaches while maintaining reasonable computational overhead.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp