Gated Attention
Gated Attention was proposed in May 2025 by the Alibaba Tongyi Qianwen team in collaboration with research teams from the University of Edinburgh, Stanford University, and other universities. The relevant research findings were published in the paper "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free", won the Best Paper Award at NeurIPS 2025.
The research team systematically investigated a series of gated-enhanced softmax attention variants through large-scale experiments (covering 30 variants of 15B MoE and 1.7B dense models, trained on 3.5T tokens). The study found that applying a specific-head sigmoid gating after Scaled Dot Product Attention (SDPA) can consistently improve model performance. This achievement highlights the impact of gating mechanisms on model performance and behavior in standard attention layers, revealing their ability to introduce nonlinearity, sparsity, and eliminate attention traps through evaluation of gating variants. These findings deepen the industry's understanding of gated attention mechanisms.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.