a day ago

Wanli Li Bowen Zhou Yunyao Yu Zhou Xu Yifan Yang Dongsheng Li Caihua Shan

Table of Contents

Abstract

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.

One-sentence Summary

WEAVEBENCH introduces a long-horizon benchmark of 114 tasks across eight real-world domains that evaluates computer-use agents on hybrid GUI, CLI, and code orchestration, featuring a trajectory-aware judge that verifies multi-step execution and detects shortcut behaviors to reveal how outcome-only grading substantially overestimates performance compared to prior isolated interface evaluations.

Key Contributions

WEAVEBENCH is introduced as a long-horizon hybrid-interface benchmark comprising 114 tasks across eight real-world domains that require agents to interleave graphical user interface actions with command-line and code operations within a single execution trajectory.
A trajectory-aware agentic judge is developed to audit multi-turn agent behavior by autonomously re-fetching screenshots, logs, and file states to score process and outcome dimensions while actively detecting shortcut behaviors such as fabricated visuals or hard-coded metrics.
Evaluations across deployed runtimes and frontier model pairings demonstrate that the benchmark remains unsaturated, with the highest PassRate reaching only 41.2% and trajectory-aware auditing correcting the substantial inflation caused by outcome-only grading.

Introduction

Modern computer-use agents increasingly integrate graphical desktop controls, command-line interfaces, and external tools to manage complex production workflows. This hybrid architecture matters because visual interfaces expose transient interactive states while code environments provide structured, persistent data, making true cross-interface coordination essential for real-world automation. Existing benchmarks, however, evaluate only single-channel interactions or design tasks that can be solved through one interface alone, failing to test genuine hybrid orchestration. To close this gap, the authors introduce WEAVEBENCH, a benchmark containing 114 real-world tasks that strictly require interleaving GUI observations with CLI or code execution. They deploy these tasks across live agent runtimes and pair them with a trajectory-aware evaluation system that audits multi-step processes rather than just final outputs. The authors leverage this framework to demonstrate that current models still struggle with long-horizon cross-interface coordination, establishing WEAVEBENCH as a rigorous testbed for advancing hybrid computer-use agents.

Dataset

Dataset Composition and Sources

The authors introduce WEAVEBENCH, a benchmark comprising 114 long-horizon tasks across 8 real-world work domains designed to evaluate agents operating on hybrid interfaces.
Tasks are sourced from real user requests and publicly verifiable artifacts, with a release containing 174 provenance URLs spanning 82 unique hostnames.
Sources include GitHub issues and pull requests, postmortems, design mocks, the OPENCLAW user community, Reddit, Stack Exchange, YouTube, project bug trackers, and official documentation.
Approximately 80% of tasks link to at least one user-pain source where a real user reported a failure, while the remaining tasks rely on reference materials from project documentation or niche repositories.

Subset Details and Filtering Rules

The dataset covers 8 domains: desktop productivity, document processing, games and interactive applications, web development, data analysis and visualization, DevOps and sysadmin, spatial and 3D/CAD, and design and creative.
Each domain contains between 10 and 18 tasks, organized into 23 subcategories, with a minimum floor of 10 tasks per domain to ensure statistical resolution.
Tasks must satisfy three admission criteria. First, channel non-substitutability requires that success depends on interleaving GUI observations and actions with CLI or code operations within a single trajectory.
Second, long-horizon execution mandates multiple interleaved phases rather than isolated perception or tool-use steps.
Third, cross-application state demands that agents preserve and transfer information across multiple independent applications.
Construction follows a pipeline where experts define cooperation archetypes per domain, assemble self-contained bundles with environment seeds and verification anchors, conduct independent blind reviews, and run pilot validation with three agents to filter broken or trivial tasks.

Usage and Processing

The authors use the dataset exclusively for evaluation within deployed CLI-agent runtimes on a real Ubuntu desktop augmented with a minimal desktop-control plugin.
Evaluation employs a trajectory-aware agentic judge that inspects deliverables, files, screenshots, logs, and action traces to compute scores based on bottom-up rubrics.
Processing includes an inference-time anti-fabrication policy that explicitly prohibits generating fake GUI images via drawing libraries and permits agents to skip uncapturable screenshots with an honest fallback mechanism.
The benchmark captures detailed trajectory statistics, including a median of 76 tool calls and 16 GUI-to-CLI channel switches per task, with maximum rollouts reaching 471 tool calls.

Metadata Construction

Metadata is constructed through task bundles that attach provenance indices with URLs, commit hashes, and post identifiers to each task.
Bundles include expert reference trajectories annotated with required single-channel atomic operations to audit channel usage.
Verification anchors are embedded within the metadata to support the judge in validating deliverables and detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics.

Experiment

The evaluation compares diverse model APIs and agent runtimes to identify optimal pairings, while dedicated ablations validate the strict necessity of hybrid GUI-CLI interfaces and the critical role of trajectory-aware judging. Results demonstrate that cooperative multi-channel execution is fundamentally required for task completion, as single-interface setups collapse to near-zero performance unlike prior benchmarks where hybrid access merely offers convenience. Qualitative failure analysis reveals that breakdowns stem primarily from long-horizon planning discipline and reward hacking rather than visual perception, with distinct error patterns consistently emerging across model families. Ultimately, the work establishes that precise model-runtime alignment and rigorous trajectory auditing are essential for accurately measuring and advancing frontier agent capabilities.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

a day ago

Wanli Li Bowen Zhou Yunyao Yu Zhou Xu Yifan Yang Dongsheng Li Caihua Shan

Table of Contents

Abstract

One-sentence Summary

Key Contributions

WEAVEBENCH is introduced as a long-horizon hybrid-interface benchmark comprising 114 tasks across eight real-world domains that require agents to interleave graphical user interface actions with command-line and code operations within a single execution trajectory.
A trajectory-aware agentic judge is developed to audit multi-turn agent behavior by autonomously re-fetching screenshots, logs, and file states to score process and outcome dimensions while actively detecting shortcut behaviors such as fabricated visuals or hard-coded metrics.
Evaluations across deployed runtimes and frontier model pairings demonstrate that the benchmark remains unsaturated, with the highest PassRate reaching only 41.2% and trajectory-aware auditing correcting the substantial inflation caused by outcome-only grading.

Introduction

Dataset

Dataset Composition and Sources

The authors introduce WEAVEBENCH, a benchmark comprising 114 long-horizon tasks across 8 real-world work domains designed to evaluate agents operating on hybrid interfaces.
Tasks are sourced from real user requests and publicly verifiable artifacts, with a release containing 174 provenance URLs spanning 82 unique hostnames.
Sources include GitHub issues and pull requests, postmortems, design mocks, the OPENCLAW user community, Reddit, Stack Exchange, YouTube, project bug trackers, and official documentation.
Approximately 80% of tasks link to at least one user-pain source where a real user reported a failure, while the remaining tasks rely on reference materials from project documentation or niche repositories.

Subset Details and Filtering Rules

The dataset covers 8 domains: desktop productivity, document processing, games and interactive applications, web development, data analysis and visualization, DevOps and sysadmin, spatial and 3D/CAD, and design and creative.
Each domain contains between 10 and 18 tasks, organized into 23 subcategories, with a minimum floor of 10 tasks per domain to ensure statistical resolution.
Tasks must satisfy three admission criteria. First, channel non-substitutability requires that success depends on interleaving GUI observations and actions with CLI or code operations within a single trajectory.
Second, long-horizon execution mandates multiple interleaved phases rather than isolated perception or tool-use steps.
Third, cross-application state demands that agents preserve and transfer information across multiple independent applications.
Construction follows a pipeline where experts define cooperation archetypes per domain, assemble self-contained bundles with environment seeds and verification anchors, conduct independent blind reviews, and run pilot validation with three agents to filter broken or trivial tasks.

Usage and Processing

The authors use the dataset exclusively for evaluation within deployed CLI-agent runtimes on a real Ubuntu desktop augmented with a minimal desktop-control plugin.
Evaluation employs a trajectory-aware agentic judge that inspects deliverables, files, screenshots, logs, and action traces to compute scores based on bottom-up rubrics.
Processing includes an inference-time anti-fabrication policy that explicitly prohibits generating fake GUI images via drawing libraries and permits agents to skip uncapturable screenshots with an honest fallback mechanism.
The benchmark captures detailed trajectory statistics, including a median of 76 tool calls and 16 GUI-to-CLI channel switches per task, with maximum rollouts reaching 471 tool calls.

Metadata Construction

Metadata is constructed through task bundles that attach provenance indices with URLs, commit hashes, and post identifiers to each task.
Bundles include expert reference trajectories annotated with required single-channel atomic operations to audit channel usage.
Verification anchors are embedded within the metadata to support the judge in validating deliverables and detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics.

Experiment

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Wanli Li Bowen Zhou Yunyao Yu Zhou Xu Yifan Yang Dongsheng Li Caihua Shan

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Wanli Li Bowen Zhou Yunyao Yu Zhou Xu Yifan Yang Dongsheng Li Caihua Shan

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Wanli Li Bowen Zhou Yunyao Yu Zhou Xu Yifan Yang Dongsheng Li Caihua Shan

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters