Agent Entropy Balancing Strategy Optimization AEPO
Agentic Entropy-Balanced Policy Optimization (AEPO) was proposed in October 2025 by a joint research team from Renmin University of China and Kuaishou. The relevant research findings were published in the paper "[…]".Agentic Entropy-Balanced Policy Optimization".
AEPO is an agent reinforcement learning (RL) algorithm designed to balance entropy during policy unfolding and policy update phases. It consists of two core components: (1) a dynamic entropy balancing extension mechanism that adaptively allocates global and branch sampling budgets through entropy pre-monitoring while imposing branch penalties on successive high-entropy tool call steps to prevent over-branching; and (2) an entropy balancing policy optimization that inserts a stopping gradient operation into high-entropy pruning terms to preserve and appropriately rescale gradients on high-entropy labels, while incorporating entropy-aware advantage estimation to prioritize learning high-uncertainty labels. Results on 14 challenging datasets demonstrate that AEPO consistently outperforms 7 mainstream RL algorithms.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.