6 hours ago

Chenxu Wang Chaozhuo Li Songyang Liu Zejian Chen Jinyu Hou Ji Qi Rui Li Litian Zhang Qiwei Ye Zheng Liu

Table of Contents

Abstract

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.

One-sentence Summary

Chenxu Wang et al. from Tsinghua, Fudan, and UIC propose the “self-evolution trilemma,” proving that isolated LLM agent societies inevitably degrade safety alignment due to statistical blind spots, and advocate for external oversight or novel mechanisms to preserve safety in evolving AI systems.

Key Contributions

We identify and formalize the "self-evolution trilemma"—the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance in LLM-based agent societies—using an information-theoretic framework that quantifies safety as KL divergence from anthropic value distributions.
We theoretically prove that isolated self-evolution induces irreversible safety degradation via statistical blind spots, and empirically validate this through qualitative analysis of Moltbook and quantitative evaluation of closed self-evolving systems, revealing failure modes like consensus hallucinations and alignment collapse.
Our work establishes a fundamental limit on autonomous AI societies and proposes solution directions that shift safety discourse from ad hoc patches to principled mechanisms requiring external oversight or novel safety-preserving architectures.

Introduction

The authors leverage multi-agent systems built from large language models to explore the fundamental limits of self-evolving AI societies. They frame safety as a low-entropy state aligned with human values and show that in closed, isolated systems—where agents learn solely from internal interactions—safety alignment inevitably degrades due to entropy increase and information loss. Prior work focused on enhancing capabilities or patching safety reactively, lacking a principled understanding of why safety fails in recursive settings. The authors’ main contribution is proving the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance, formalized via information theory and validated through empirical analysis of real agent communities like Moltbook, which exhibit cognitive degeneration, alignment failure, and communication collapse. They propose solution directions centered on external oversight and entropy injection to preserve safety without halting evolution.

Method

The authors leverage a formal probabilistic framework to model the self-evolution of multi-agent systems under conditions of isolation from external safety references. The core architecture treats each agent as a parametric policy $P_{\theta}$ , defined over a discrete semantic space $\mathcal{Z}$ , which encompasses all possible token sequences generated by the model. The system state at round $t$ is represented by the joint parameter vector $\Theta_t = (\theta_t^{(1)}, \ldots, \theta_t^{(M)})$ for $M$ agents, with each agent’s output distribution $P_{\theta_t^{(m)}}$ contributing to a weighted mixture $\bar{P}_t(z)$ .

As shown in the figure below, the self-evolution process operates as a closed-loop Markov chain: at each round, the current population state $\Theta_t$ generates a synthetic dataset $\mathcal{D}_{t+1}$ via a finite-sampling step, which is then used to update each agent’s parameters via maximum-likelihood estimation. This update mechanism is entirely internal, with no access to the external safety reference distribution $\pi^*$ , which is treated as an implicit target encoding human-aligned safety criteria. The isolation condition ensures that $\Theta_{t+1}$ is conditionally independent of $\pi^*$ , formalizing the system’s recursive, self-contained nature.

The training process is structured in two phases per round. First, the finite-sampling step constructs an effective training distribution $P_t(z)$ by applying a state-dependent selection mechanism $a_{\Theta_t}(z)$ to the mixture $\bar{P}_t(z)$ , followed by normalization. A dataset $\mathcal{D}_{t+1}$ of size $N$ is then sampled i.i.d. from $P_t(z)$ . Second, in the parameter-update step, each agent minimizes the empirical negative log-likelihood over $\mathcal{D}_{t+1}$ , which inherently biases learning toward regions of $\mathcal{Z}$ that are well-represented in the sample. Regions with low probability under $P_t$ are likely to be absent from $\mathcal{D}_{t+1}$ , leading to a lack of maintenance signals for those regions in the update.

This recursive process induces progressive drift from the safety distribution $\pi^*$ , as regions of the safe set $\mathcal{S}$ that fall below a sampling threshold $\tau$ become increasingly invisible to the system. The authors formalize this as coverage shrinkage, where $\text{Cov}_t(\tau) = \pi^*(\mathcal{C}_t(\tau))$ decreases over time, and demonstrate that such shrinkage leads to either a reduction in safe probability mass or a collapse of the distribution within $\mathcal{S}$ , both of which increase the KL divergence $D_{\text{KL}}(\pi^* \parallel P_t)$ . The result is a system that, under isolation, systematically forgets safety constraints and converges toward misaligned modes.

To counteract this drift, the authors propose four intervention strategies. Strategy A introduces an external verifier—termed “Maxwell’s Demon”—that filters unsafe or high-entropy samples before they enter the training loop. As illustrated in the figure below, this verifier can be rule-based for speed or human-in-the-loop for thoroughness, acting as an entropy-reducing checkpoint.

Strategy B implements “thermodynamic cooling” via periodic system resets or rollbacks to a verified safe checkpoint, capping entropy accumulation. Strategy C injects diversity through increased sampling temperature or external data to prevent mode collapse. Strategy D enables “entropy release” by pruning agent memory or inducing knowledge forgetting, actively dissipating accumulated unsafe information. Each strategy targets a different facet of the entropic decay inherent in isolated self-evolution, aiming to preserve safety invariance while permitting continuous adaptation.

Experiment

Qualitative analysis of Moltbook reveals that closed multi-agent systems naturally devolve into disorder without human intervention, manifesting as cognitive degeneration, alignment failure, and communication collapse—indicating safety decay is systemic, not accidental.
Quantitative evaluation of RL-based and memory-based self-evolving systems shows both paradigms progressively lose safety: jailbreak susceptibility increases and truthfulness declines over 20 rounds.
RL-based evolution degrades safety more rapidly and with higher variance, while memory-based evolution preserves jailbreak resistance slightly longer but accelerates hallucination due to propagated inaccuracies.
Both paradigms confirm that isolated self-evolution inevitably erodes adversarial robustness and factual reliability, regardless of mechanism.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 hours ago

Chenxu Wang Chaozhuo Li Songyang Liu Zejian Chen Jinyu Hou Ji Qi Rui Li Litian Zhang Qiwei Ye Zheng Liu

Table of Contents

Abstract

One-sentence Summary

Key Contributions

We identify and formalize the "self-evolution trilemma"—the impossibility of simultaneously achieving continuous self-evolution, complete isolation, and safety invariance in LLM-based agent societies—using an information-theoretic framework that quantifies safety as KL divergence from anthropic value distributions.
We theoretically prove that isolated self-evolution induces irreversible safety degradation via statistical blind spots, and empirically validate this through qualitative analysis of Moltbook and quantitative evaluation of closed self-evolving systems, revealing failure modes like consensus hallucinations and alignment collapse.
Our work establishes a fundamental limit on autonomous AI societies and proposes solution directions that shift safety discourse from ad hoc patches to principled mechanisms requiring external oversight or novel safety-preserving architectures.

Introduction

Method

Experiment

Qualitative analysis of Moltbook reveals that closed multi-agent systems naturally devolve into disorder without human intervention, manifesting as cognitive degeneration, alignment failure, and communication collapse—indicating safety decay is systemic, not accidental.
Quantitative evaluation of RL-based and memory-based self-evolving systems shows both paradigms progressively lose safety: jailbreak susceptibility increases and truthfulness declines over 20 rounds.
RL-based evolution degrades safety more rapidly and with higher variance, while memory-based evolution preserves jailbreak resistance slightly longer but accelerates hallucination due to propagated inaccuracies.
Both paradigms confirm that isolated self-evolution inevitably erodes adversarial robustness and factual reliability, regardless of mechanism.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang Chaozhuo Li Songyang Liu Zejian Chen Jinyu Hou Ji Qi Rui Li Litian Zhang Qiwei Ye Zheng Liu3 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang Chaozhuo Li Songyang Liu Zejian Chen Jinyu Hou Ji Qi Rui Li Litian Zhang Qiwei Ye Zheng Liu3 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang Chaozhuo Li Songyang Liu Zejian Chen Jinyu Hou Ji Qi Rui Li Litian Zhang Qiwei Ye Zheng Liu3 more

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Chenxu Wang Chaozhuo Li Songyang Liu Zejian Chen Jinyu Hou Ji Qi Rui Li Litian Zhang Qiwei Ye Zheng Liu

Chenxu Wang Chaozhuo Li Songyang Liu Zejian Chen Jinyu Hou Ji Qi Rui Li Litian Zhang Qiwei Ye Zheng Liu

Chenxu Wang Chaozhuo Li Songyang Liu Zejian Chen Jinyu Hou Ji Qi Rui Li Litian Zhang Qiwei Ye Zheng Liu