@article{ye2026claw, title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents}, author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others}, journal={arXiv preprint arXiv:2604.06132}, year={2026} }

Date

2 months ago

Organization

Paper URL

2604.06132

License

MIT

Dataset composition:

General: Contains 161 core agent tasks, covering 24 categories including communications, finance, operations, and office productivity.
Multimodal: Includes 101 multimodal agent tasks, covering scenarios such as webpage generation, video question answering, and document information extraction.
Multi-turn dialogue: This section contains 38 multi-turn dialogue tasks, requiring the Agent to interact with simulated users in multiple rounds to clarify needs and generate suggestions.

Data fields:

task_id: Unique identifier for the task
query: task instructions or task description
fixture: List of auxiliary files required for the task
language: Task language
Category: The domain or category to which the task belongs

Citation

@article{ye2026claw,
title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents},
author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others},
journal={arXiv preprint arXiv:2604.06132},
year={2026}
}

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset Discuss on Discord

Date

2 months ago

Organization

Paper URL

2604.06132

License

MIT

Dataset composition:

General: Contains 161 core agent tasks, covering 24 categories including communications, finance, operations, and office productivity.
Multimodal: Includes 101 multimodal agent tasks, covering scenarios such as webpage generation, video question answering, and document information extraction.
Multi-turn dialogue: This section contains 38 multi-turn dialogue tasks, requiring the Agent to interact with simulated users in multiple rounds to clarify needs and generate suggestions.

Data fields:

task_id: Unique identifier for the task
query: task instructions or task description
fixture: List of auxiliary files required for the task
language: Task language
Category: The domain or category to which the task belongs

Citation

@article{ye2026claw,
title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents},
author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others},
journal={arXiv preprint arXiv:2604.06132},
year={2026}
}

Related Datasets

MemLens Multimodal Long Context Benchmark Dataset

19 days ago

VisCoR-55K Visual Inference Dataset

a month ago

AgentTrove Intelligent Agent Interaction Trajectory Dataset

a month ago

LongBlocks Long Context Multilingual Question Answering Dataset

a month ago

MathNet Multimodal Mathematical Benchmark Inference Dataset

a month ago

Fundus Eye Disease Classification Dataset

a month ago

Long-Distance Wildfire & Smoke Detection Dataset

a month ago

QCalEval Quantum Calibration Graph Understanding Dataset

2 months ago

RSRCC Remote Sensing Area Change Understanding Benchmark Dataset

a day ago

PanScale Remote Sensing Pancolor Sharpening Dataset

2 months ago

ParseBench Document Parsing Capability Evaluation Dataset

2 months ago

OpenMementos Context Memory Compressed Dataset

2 months ago

MIA Multistep Inference and Decision Trajectory Dataset

2 months ago

OmniParsingBench Multimodal Parsing Capability Evaluation Dataset

a day ago

MDPBench Multilingual Document Parsing Benchmark Dataset

a day ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Claw-Eval Real-World Benchmark Dataset

Dataset composition:

Data fields:

Citation

Build AI with AI

HyperAI Newsletters

Command Palette

Claw-Eval Real-World Benchmark Dataset

Dataset composition:

Data fields:

Citation

Related Datasets

MemLens Multimodal Long Context Benchmark Dataset

VisCoR-55K Visual Inference Dataset

AgentTrove Intelligent Agent Interaction Trajectory Dataset

LongBlocks Long Context Multilingual Question Answering Dataset

MathNet Multimodal Mathematical Benchmark Inference Dataset

Fundus Eye Disease Classification Dataset

Long-Distance Wildfire & Smoke Detection Dataset

QCalEval Quantum Calibration Graph Understanding Dataset

RSRCC Remote Sensing Area Change Understanding Benchmark Dataset

PanScale Remote Sensing Pancolor Sharpening Dataset

ParseBench Document Parsing Capability Evaluation Dataset

OpenMementos Context Memory Compressed Dataset

MIA Multistep Inference and Decision Trajectory Dataset

OmniParsingBench Multimodal Parsing Capability Evaluation Dataset

MDPBench Multilingual Document Parsing Benchmark Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

Claw-Eval Real-World Benchmark Dataset

Dataset composition:

Data fields:

Citation

Related Datasets

MemLens Multimodal Long Context Benchmark Dataset

VisCoR-55K Visual Inference Dataset

AgentTrove Intelligent Agent Interaction Trajectory Dataset

LongBlocks Long Context Multilingual Question Answering Dataset

MathNet Multimodal Mathematical Benchmark Inference Dataset

Fundus Eye Disease Classification Dataset

Long-Distance Wildfire & Smoke Detection Dataset

QCalEval Quantum Calibration Graph Understanding Dataset

RSRCC Remote Sensing Area Change Understanding Benchmark Dataset

PanScale Remote Sensing Pancolor Sharpening Dataset

ParseBench Document Parsing Capability Evaluation Dataset

OpenMementos Context Memory Compressed Dataset

MIA Multistep Inference and Decision Trajectory Dataset

OmniParsingBench Multimodal Parsing Capability Evaluation Dataset

MDPBench Multilingual Document Parsing Benchmark Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

MemLens Multimodal Long Context Benchmark Dataset

VisCoR-55K Visual Inference Dataset

AgentTrove Intelligent Agent Interaction Trajectory Dataset

LongBlocks Long Context Multilingual Question Answering Dataset

MathNet Multimodal Mathematical Benchmark Inference Dataset

Fundus Eye Disease Classification Dataset

Long-Distance Wildfire & Smoke Detection Dataset

QCalEval Quantum Calibration Graph Understanding Dataset

RSRCC Remote Sensing Area Change Understanding Benchmark Dataset

PanScale Remote Sensing Pancolor Sharpening Dataset

ParseBench Document Parsing Capability Evaluation Dataset

OpenMementos Context Memory Compressed Dataset

MIA Multistep Inference and Decision Trajectory Dataset

OmniParsingBench Multimodal Parsing Capability Evaluation Dataset

MDPBench Multilingual Document Parsing Benchmark Dataset

Related Datasets

MemLens Multimodal Long Context Benchmark Dataset

VisCoR-55K Visual Inference Dataset

AgentTrove Intelligent Agent Interaction Trajectory Dataset

LongBlocks Long Context Multilingual Question Answering Dataset

MathNet Multimodal Mathematical Benchmark Inference Dataset

Fundus Eye Disease Classification Dataset

Long-Distance Wildfire & Smoke Detection Dataset

QCalEval Quantum Calibration Graph Understanding Dataset

RSRCC Remote Sensing Area Change Understanding Benchmark Dataset

PanScale Remote Sensing Pancolor Sharpening Dataset