Command Palette
Search for a command to run...
Claw-Eval Real-World Benchmark Dataset
Claw-Eval is an end-to-end transparent evaluation benchmark dataset for evaluating AI agents on real-world tasks, released in 2026 by Peking University in collaboration with the University of Hong Kong. The related research papers are as follows: Claw-Eval: Toward Trustworthy Evaluation of Autonomous AgentsIt aims to evaluate the ability of autonomous intelligent agents to perform tasks, invoke tools, understand multimodal phenomena, and interact in real-world environments. It is widely used in agent system evaluation, automated task execution, multimodal intelligent agent research, and large model capability analysis. This dataset supports both English and Chinese languages and includes three core task groups: General, Multimodal, and Multi-turn, covering a total of 24 task categories such as communication, finance, office, and productivity tools.
Dataset composition:
- General: Contains 161 core agent tasks, covering 24 categories including communications, finance, operations, and office productivity.
- Multimodal: Includes 101 multimodal agent tasks, covering scenarios such as webpage generation, video question answering, and document information extraction.
- Multi-turn dialogue: This section contains 38 multi-turn dialogue tasks, requiring the Agent to interact with simulated users in multiple rounds to clarify needs and generate suggestions.
Data fields:
- task_id: Unique identifier for the task
- query: task instructions or task description
- fixture: List of auxiliary files required for the task
- language: Task language
- Category: The domain or category to which the task belongs
Citation
@article{ye2026claw,
title={Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents},
author={Ye, Bowen and Li, Rang and Yang, Qibin and Liu, Yuanxin and Yao, Linli and Lv, Hanglong and Xie, Zhihui and An, Chenxin and Li, Lei and Kong, Lingpeng and others},
journal={arXiv preprint arXiv:2604.06132},
year={2026}
}
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.