Command Palette
Search for a command to run...
CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
Abstract
Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.
One-sentence Summary
Researchers from MIT, NUS, and other institutions introduce CORAL, an autonomous multi-agent framework that replaces rigid heuristics with persistent memory and asynchronous collaboration. This approach achieves state-of-the-art results on diverse optimization tasks, including a 20% improvement on GPU kernel engineering, by enabling sustained knowledge accumulation and agent autonomy.
Key Contributions
- The paper introduces CORAL, a framework for autonomous multi-agent evolution that replaces rigid control with long-running agents utilizing shared persistent memory, asynchronous execution, and heartbeat-based interventions to explore and collaborate on open-ended problems.
- This work establishes a new paradigm by formulating autonomous evolution as a distinct approach that delegates search decisions to agents, enabling them to iteratively refine solutions through knowledge retrieval, contribution, and distillation without fixed heuristics.
- Experiments demonstrate that CORAL sets new state-of-the-art results on 10 diverse tasks with 3–10× higher improvement rates than fixed baselines, while mechanistic analyses confirm that these gains stem from effective knowledge reuse and multi-agent communication.
Introduction
Open-ended discovery in fields like mathematical optimization and systems engineering requires sustained iterative search rather than one-shot generation, yet current LLM-based approaches rely on fixed evolutionary heuristics that limit agent autonomy. These rigid pipelines force agents to follow hard-coded rules for parent selection and exploration, preventing them from adapting their search strategy or effectively reusing knowledge across long horizons. The authors introduce CORAL, a framework that replaces these static controls with autonomous multi-agent evolution where long-running agents collaborate through shared persistent memory and asynchronous execution. By allowing agents to decide what to explore, reflect on progress via heartbeat mechanisms, and accumulate reusable skills, CORAL achieves state-of-the-art results with significantly fewer evaluations than traditional baselines.
Dataset
-
Dataset Composition and Sources The authors utilize a curated collection of 13 evaluation tasks spanning mathematical optimization, systems optimization, and stress-test problems. These tasks are sourced from established benchmarks like ADRS and include specific challenges such as the Erdős Minimum Overlap Problem, Transaction Scheduling, and Kernel Engineering. The data is structured via YAML configuration files that define task metadata, grading logic, and agent parameters.
-
Key Details for Each Subset
- Mathematical Optimization: Includes 6 tasks such as circle packing and signal processing, where agents must solve complex inequalities or minimize overlap integrals.
- Systems Optimization: Comprises 5 tasks like EPLB and LLM-SQL, focusing on algorithmic improvements without web search capabilities.
- Stress-Test Problems: Features high-complexity scenarios like the VLIW SIMD Kernel Builder, which requires optimizing code to reduce cycle counts from a baseline of ~147,734 to a best-known ~1,363.
- Grading Logic: Each task employs a Python-based grader that executes agent solutions in a subprocess, validates constraints, and returns a numerical score or error status.
-
Usage in Model Training and Evaluation The authors use this dataset exclusively for evaluation rather than training, running agents asynchronously to generate performance data. The system supports both single-agent and multi-agent configurations, with the latter allowing up to 4 agents to collaborate on a single task. Evaluation metrics are derived from the ratio of the agent's score against the baseline and best-known solutions, with specific timeouts (e.g., 600s or 1100s) enforced per task.
-
Processing and Metadata Construction
- Shared Persistent Memory: The authors implement a centralized filesystem within the
.coral/public/directory to store artifacts. This includes attempt records (JSON files keyed by commit hash), hierarchical notes (Markdown with YAML frontmatter), and reusable skills. - Artifact Management: Agents interact with shared memory via symbolic links to their isolated worktrees, preventing accidental commits while enabling access to native file tools.
- Concurrency Handling: The system avoids explicit locking by assigning unique filenames for notes and skills and using commit hashes for attempt records, ensuring no file-level conflicts occur during asynchronous execution.
- Metadata Enrichment: Each attempt record captures the agent ID, score, status, parent hash, and detailed feedback, while notes are organized by topic to consolidate knowledge and track open questions.
- Shared Persistent Memory: The authors implement a centralized filesystem within the
Method
The authors propose CORAL, a framework designed to facilitate autonomous multi-agent evolution for open-ended discovery tasks. In this paradigm, the goal is to iteratively discover increasingly strong candidate solutions under evaluator feedback without a known optimal target. The process is abstracted into four stages: Retrieve, Propose, Evaluate, and Update.
Refer to the comparison of search paradigms below:

Traditional methods often rely on fixed evolutionary search where external rules govern the Retrieve and Update stages, limiting the agent's role primarily to Propose. In contrast, CORAL implements autonomous single-agent evolution where the agent controls the timing and realization of all four stages. This is further extended to autonomous multi-agent evolution, where multiple agents run asynchronously and coordinate through shared persistent memory rather than direct communication. This design increases exploration diversity and allows agents to inspire one another indirectly.
The overall workflow of the CORAL framework is illustrated in the diagram below:

The system operates through a central Manager Infra that handles agent lifecycle and heartbeat coordination. Each agent runs in an isolated workspace and executes an autonomous loop consisting of reading attempts, notes, and skills from the shared memory, planning and editing code, running the evaluation grader, and writing new notes and skills back to the shared store. The Shared Persistent Memory is structured as a file system with three root folders: attempts for historical evaluations, notes for observations and reflections, and skills for reusable procedures. To prevent agents from stagnating in local minima, a Heartbeat Monitor triggers periodic interventions such as Reflection (recording notes), Consolidation (organizing notes into skills), and Redirection (pivoting strategy when no improvement is observed).
The underlying software architecture is modular and organized into six key components as shown below:

The Configuration module parses YAML task definitions to initialize the system. The Agent System manages the lifecycle of agents through an AgentManager and HeartbeatRunner, ensuring persistence and handling interruptions. The Grader Hierarchy provides a pluggable evaluation interface where a BaseGrader defines the protocol for scoring candidates, implemented by specific graders like TaskGrader or FunctionGrader. Workspace Setup creates isolated per-agent worktrees with symbolic links to the Hub, which stores the shared persistent memory. Finally, Core Types define the data models used throughout the system, such as Task, Score, and ScoreBundle, ensuring consistent data flow between components.
Experiment
- CORAL is evaluated on mathematical optimization, systems optimization, and challenging stress-test problems, demonstrating that autonomous multi-agent evolution significantly outperforms fixed evolutionary search baselines by achieving new state-of-the-art results on the majority of tasks.
- The autonomous design allows agents to dynamically decide exploration strategies and pivot approaches based on feedback, resulting in much higher improvement rates and faster convergence compared to methods relying on predefined heuristics.
- Multi-agent co-evolution extends the search frontier beyond single-agent capabilities, particularly on complex tasks where individual runs plateau early, by enabling diverse exploration trajectories and the organic diffusion of techniques through shared persistent memory.
- Qualitative analysis reveals that local verification of code before external evaluation and the accumulation of reusable knowledge artifacts are critical drivers of performance, especially for advanced tasks requiring deep architectural insights.
- Ablation studies confirm that the performance gains stem from the co-evolution mechanism and knowledge accumulation rather than simply increased compute resources, with benefits generalizing effectively to open-source model stacks.