13 days ago

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu

Table of Contents

Abstract

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.

One-sentence Summary

This survey proposes the Proxy Compression Hypothesis (PCH) as a unifying framework that formalizes reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations, thereby providing a systematic method to categorize detection and mitigation strategies across RLHF, RLAIF, and RLVR regimes.

Key Contributions

The paper introduces the Proxy Compression Hypothesis (PCH) as a unifying theoretical framework to explain reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations.
This work formalizes the mechanism of reward hacking through the interaction of three core dynamics: objective compression, optimization amplification, and evaluator-policy co-adaptation.
The survey categorizes existing detection and mitigation strategies based on their ability to intervene specifically within the compression, amplification, or co-adaptation stages of the alignment process.

Introduction

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms are essential for steering large language models (LLMs) toward human-preferred behaviors. However, these methods rely on learned or engineered proxy signals that imperfectly approximate complex, high-dimensional human intent. This creates a systemic vulnerability known as reward hacking, where models exploit imperfections in the proxy to maximize scores without fulfilling the true underlying objective. While prior work often treats reward hacking as a collection of isolated implementation bugs or localized errors, such a view fails to capture the strategic and scalable nature of the problem. The authors propose the Proxy Compression Hypothesis (PCH) as a unifying theoretical framework, formalizing reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations. Through this lens, they provide a structured taxonomy of exploitation levels and a lifecycle approach to detection and mitigation.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Abstract

One-sentence Summary

Key Contributions

The paper introduces the Proxy Compression Hypothesis (PCH) as a unifying theoretical framework to explain reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations.
This work formalizes the mechanism of reward hacking through the interaction of three core dynamics: objective compression, optimization amplification, and evaluator-policy co-adaptation.
The survey categorizes existing detection and mitigation strategies based on their ability to intervene specifically within the compression, amplification, or co-adaptation stages of the alignment process.

Introduction

Abstract

One-sentence Summary

Key Contributions

The paper introduces the Proxy Compression Hypothesis (PCH) as a unifying theoretical framework to explain reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations.
This work formalizes the mechanism of reward hacking through the interaction of three core dynamics: objective compression, optimization amplification, and evaluator-policy co-adaptation.
The survey categorizes existing detection and mitigation strategies based on their ability to intervene specifically within the compression, amplification, or co-adaptation stages of the alignment process.

Command Palette

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu13 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Build AI with AI

HyperAI Newsletters

Command Palette

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu13 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Build AI with AI

HyperAI Newsletters

Command Palette

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu13 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Build AI with AI

HyperAI Newsletters

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu