HyperAIHyperAI

Command Palette

Search for a command to run...

4 hours ago
LLM
Reasoning

Efficient Reasoning with Balanced Thinking

Yulin Li Tengyao Tu Li Ding Junjie Wang Huiling Zhen Yixin Chen Yong Li Zhuotao Tian

Abstract

Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Code is available at https://github.com/yu-lin-li/ReBalance .

One-sentence Summary

Researchers from Harbin Institute of Technology and collaborating institutes propose REBALANCE, a training-free framework that uses confidence-based steering vectors to dynamically balance reasoning depth. This approach effectively mitigates overthinking and underthinking in Large Reasoning Models, enhancing accuracy and efficiency across math, coding, and general question-answering benchmarks without requiring fine-tuning.

Key Contributions

  • The paper introduces REBALANCE, a training-free framework that achieves efficient reasoning by leveraging confidence as a continuous indicator to identify overthinking through high variance and underthinking via consistent overconfidence.
  • A steering vector is computed by aggregating hidden states into reasoning mode prototypes, which a dynamic control function modulates in real-time to prune redundancy or promote exploration based on the model's confidence levels.
  • Extensive experiments across four models ranging from 0.5B to 32B and nine benchmarks demonstrate that the method effectively reduces output redundancy while simultaneously improving accuracy in math reasoning, general question answering, and coding tasks.

Introduction

Large Reasoning Models (LRMs) excel at complex tasks but often suffer from inefficiency due to overthinking on simple problems or underthinking on difficult ones, which hinders their deployment in resource-constrained environments. Prior attempts to fix overthinking by suppressing reflection or shortening reasoning chains frequently backfire by inducing underthinking, leading to premature and inaccurate conclusions. The authors leverage confidence as a continuous signal to distinguish between these two states and propose REBALANCE, a training-free framework that dynamically steers the model's hidden states to prune redundancy during overthinking while encouraging exploration during underthinking.

Dataset

  • The authors curate a diverse evaluation suite spanning mathematics, science, and coding, drawing from established benchmarks like MATH-500, AIME, GSM8K, GPQA DIAMOND, and LIVECODEBENCH.
  • The dataset composition includes three difficulty tiers: simple sets like GSM8K (1,319 problems) and AMC23 (40 problems); moderate sets like MATH-500 (500 problems); and hard sets including AIME24/AIME25 (30 problems each), GPQA DIAMOND (198 problems), OLYMPIADBENCH (675 problems), and LIVECODEBENCH v1 (400 problems).
  • Specific filtering and sourcing rules apply to each subset, such as using the official 2024/2025 AIME cycles, selecting expert-authored graduate-level questions for GPQA, and ensuring contamination awareness in LIVECODEBENCH by using version v1 with execution-based unit tests.
  • For training and evaluation, the authors utilize standard splits where available, such as the ~7.5k training and ~1k test split for GSM8K, while treating other benchmarks as held-out test sets to assess reasoning capabilities.
  • The processing pipeline applies a unified prompt template across all math-related subsets, instructing the model to reason step by step and format the final answer within a boxed notation.

Method

The authors propose ReBALANCE, a training-free framework designed to dynamically balance overthinking and underthinking in Large Reasoning Models (LRMs) to improve efficiency without compromising accuracy. The framework operates through a two-stage process involving offline data collection and online inference with dynamic steering. Refer to the framework diagram for a comprehensive overview of the system architecture.

To effectively control the reasoning process, the method first explicitly models reasoning states prone to overthinking or underthinking using stepwise confidence and confidence variance. Overthinking is identified as a state characterized by low confidence and high variance, reflecting unstable or oscillating reasoning trajectories. Conversely, underthinking is defined by persistently high confidence and low variance, indicating premature convergence. Refer to the examples illustrating these distinct reasoning behaviors and the target balanced state.

The framework extracts steering vectors from the hidden states of the LRM to guide the model away from these undesirable modes. During the offline stage, a one-pass data collection is performed on a small seen dataset to identify prototypes for overthinking and underthinking. The authors analyze the linear decodability of confidence signals across layers to automatically select the optimal deep layer for intervention, as visualized in the layer-wise R2R^2R2 analysis. The steering vector is then constructed as the normalized difference between the overthinking and underthinking prototypes, establishing a direction in the latent space for behavior modulation.

During online inference, a dynamic control function adaptively modulates the steering strength and direction based on real-time model states. This function takes the current stepwise confidence and confidence variance as inputs to compute a steering weight. The weight is designed to push the model's state away from the nearest reasoning boundary, ensuring the trajectory remains within a balanced region. Refer to the visualization of the control function surface, which demonstrates how the steering strength varies non-linearly based on confidence and variance levels to mitigate both overthinking and underthinking.

Experiment

  • Analysis of reasoning length distributions reveals that existing overthinking mitigation methods often induce underthinking by prematurely truncating necessary steps, whereas the proposed ReBALANCE method achieves a balanced reduction that preserves accuracy while shortening outputs.
  • Experiments demonstrate that confidence variance and step-level confidence serve as reliable indicators for distinguishing between overthinking (high variance, low confidence) and underthinking (persistently high confidence), enabling fine-grained behavioral control without auxiliary models.
  • Evaluations across diverse benchmarks in mathematics, science, code, and commonsense reasoning confirm that ReBALANCE significantly reduces token usage and inference latency while improving or maintaining Pass@1 accuracy, outperforming prompt-based and external verifier-based baselines.
  • Ablation studies validate that dynamic control based on confidence signals is superior to static adjustments, and that steering vectors extracted from medium-difficulty datasets generalize effectively across different domains and model sizes.
  • Additional tests on NPU devices and creative writing tasks show that the method maintains robust performance on specialized hardware and preserves or enhances the model's creative expressiveness and linguistic diversity.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp