
Abstract
Scaling test time compute has shown remarkable success in improving thereasoning abilities of large language models (LLMs). In this work, we conductthe first systematic exploration of applying test-time scaling methods tolanguage agents and investigate the extent to which it improves theireffectiveness. Specifically, we explore different test-time scaling strategies,including: (1) parallel sampling algorithms; (2) sequential revisionstrategies; (3) verifiers and merging methods; (4)strategies for diversifyingrollouts.We carefully analyze and ablate the impact of different designstrategies on applying test-time scaling on language agents, and have followfindings: 1. Scaling test time compute could improve the performance of agents.2. Knowing when to reflect is important for agents. 3. Among differentverification and result merging approaches, the list-wise method performs best.4. Increasing diversified rollouts exerts a positive effect on the agent's taskperformance.
Code Repositories
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.