Command Palette
Search for a command to run...
Loop Engineering: The Anthropic Playbook for Designing Systems That Prompt Your Agents
Loop Engineering: The Anthropic Playbook for Designing Systems That Prompt Your Agents
Peter Steinberger Boris Cherny Addy Osmani
Abstract
Over the past two years a string of "XX Engineering" terms has tracked the pace of model releases. This note examines the newest of them, Loop Engineering, a term independently surfaced in June 2026 by Peter Steinberger, Boris Cherny, and Addy Osmani, and named in writing by Osmani. Unlike prompt, context, or harness engineering, loop engineering does not teach the practitioner to do the work better; it removes the practitioner from the position of doing the work at all. We define the term, place it as a fourth layer above the harness, and decompose a single turn of a loop into five moves—discovery, handoff, verification, persistence, and scheduling—and the six parts that realize them. We give particular attention to the generator/evaluator separation: empirically, an agent asked to grade its own output tends to praise it, and tuning an independent skeptical evaluator is far more tractable than making a generator critical of its own work. We survey three loops running in practice, from one engineer's morning triage to Stripe's enterprise-scale pipeline merging over 1,300 machine-written pull requests per week, and we catalog four costs that accrue silently—verification debt, comprehension rot, cognitive surrender, and token blowout. We close with a concrete recipe for building a first loop. The central claim is that loops make generation nearly free and leave judgment as the scarce resource; the same loop, built by two people, can yield opposite outcomes.
One-sentence Summary
In this note, Peter Steinberger, Boris Cherny, and Addy Osmani introduce Loop Engineering as a fourth layer above harness engineering that removes practitioners from performing work by designing self-prompting agent loops, decomposing each turn into discovery, handoff, verification, persistence, and scheduling, crucially separating generator from evaluator because agents grading their own output tend to self-praise, and surveying real-world loops from a personal morning triage to Stripe’s pipeline merging over 1,300 machine-written pull requests per week, demonstrating that loops make generation nearly free while judgment becomes the scarce resource and the same loop can produce opposite outcomes in different hands.
Key Contributions
- The note defines loop engineering as a fourth layer above harness engineering, decomposing a single loop turn into five moves (discovery, handoff, verification, persistence, scheduling) and six constituent parts.
- It introduces a generator/evaluator separation, empirically showing that agents overpraise their own outputs and that an independently tuned skeptical evaluator is far more tractable than making a generator self-critical.
- The note surveys three real-world loops, catalogs four hidden costs (verification debt, comprehension rot, cognitive surrender, token blowout), provides a concrete build recipe, and establishes that loops make generation nearly free, concentrating engineering value into judgment as the scarce resource.
Introduction
The authors examine a new paradigm called Loop Engineering, which shifts the practitioner from directly prompting AI coding agents to designing autonomous systems that prompt themselves. This matters because earlier approaches—prompt, context, and harness engineering—all kept a human in the loop, limiting scalability and requiring constant attention. The key limitation of prior work is that the human must act as the clock and decision-maker, unable to step away. The authors’ main contribution is a formal definition of loop engineering, a decomposition of a loop’s turn into five moves (discovery, handoff, verification, persistence, and scheduling), and an emphasis on the generator/evaluator split to maintain judgment while automating generation.
Method
Theauthors propose a hierarchical framework for engineering AI agents, culminating in a self-running loop architecture. This framework stacks four distinct layers, each expanding the scope of concern. As shown in the figure below, the stack progresses from Prompt Engineering at the base, through Context and Harness Engineering, to Loop Engineering at the top.
Prompt Engineering manages the wording for a single exchange. Context Engineering curates the model's field of view. Harness Engineering equips a single run with tools and actions. Loop Engineering automates the entire process, allowing the system to wake on a schedule, spawn sub-agents, and feed its own output back as input for subsequent rounds.
A functional loop executes a concrete cycle of five moves rather than spinning idly. As illustrated in the diagram below, these moves form a continuous turn that feeds the next iteration.
First, Discovery identifies work worth doing, such as reading CI failures, allowing the agent to find its own tasks. Second, Handoff moves the task to an isolated environment, like a git worktree, to prevent collisions during parallel execution. Third, Verification checks the result, serving as the critical mechanism to reject poor output. Fourth, Persistence saves state to disk so the loop survives context window clearing. Finally, Scheduling triggers the next turn automatically.
To enable these moves, the architecture relies on six structural parts. Automations trigger the loop based on time or events. Worktrees provide isolation for parallel agents. Skills store permanent project knowledge to reduce intent debt. Connectors link the loop to external tools via protocols like MCP. Sub-agents split the writer from the judge. Memory ensures state persists across days outside the conversation window.
The most critical architectural decision involves the verification module. The authors note that agents tend to praise their own work, leading to a nodding loop where errors accumulate. To solve this, the framework leverages a Maker-Checker principle. As shown in the figure below, the architecture structurally splits the agent into a Generator and an Evaluator.
The Generator writes the code. The Evaluator, often a different model instructed to assume the code is broken, reviews it. Crucially, the Evaluator acts by running tests or inspecting the DOM rather than just reading code.
The stop condition is managed by a fresh model checking if a specific goal is met. The code snippet below demonstrates this logic, where a small fast model checks the condition after each turn.
For large-scale reliability, the authors describe the Stripe Minions pipeline. This architecture interleaves deterministic gates with probabilistic LLM steps. As depicted in the pipeline diagram, the process begins with a human trigger, followed by a deterministic orchestrator assembling context.
The LLM agent writes code, but a hard-coded gate runs immediately after; the agent cannot skip this step. If the lint fails, the agent fixes it. Finally, a hard-coded step commits the code, followed by human review. This structure ensures reliability comes from the quality of constraints rather than just model size.
Experiment
The evaluation contrasts local loop/desktop scheduled tasks with cloud routines and GitHub Actions schedule triggers for running background work while the user sleeps. Local scheduling demands that the machine remain powered on but enables frequent execution and direct access to local files, whereas cloud scheduling runs untethered from local state at the cost of a one-hour minimum interval and a clean clone each time. The comparison shows that no single scheduler meets all requirements, and it warns that widely circulated secondhand metrics should be treated as rough references, highlighting the greater reliability of firsthand sources.