HyperAIHyperAI

Command Palette

Search for a command to run...

MiniMax Sparse Attention

Abstract

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

One-sentence Summary

MiniMax Sparse Attention (MSA) is a Grouped Query Attention variant that employs a lightweight Index Branch to independently select Top-k key-value blocks per group, enabling exact block-sparse computation that matches standard performance while reducing per-token compute by 28.4x at one million tokens and delivering 14.2x prefill and 7.6x decoding speedups on H800 GPUs through a co-designed kernel leveraging exp-free Top-k selection and KV-outer sparse attention.

Key Contributions

  • MiniMax Sparse Attention (MSA) is a blockwise sparse attention mechanism built on Grouped Query Attention that employs a lightweight index branch to independently score and select a top-k subset of key-value blocks for each GQA group prior to exact block-sparse attention.
  • A co-designed GPU execution path implements exp-free top-k selection and KV-outer sparse attention to maximize tensor-core utilization under block-granular memory access patterns.
  • Evaluations on a 109B-parameter multimodal model demonstrate that the approach maintains performance parity with standard Grouped Query Attention while reducing per-token attention compute by 28.4x at a one-million-token context, yielding 14.2x prefill and 7.6x decoding wall-clock speedups on H800 hardware.

Introduction

Long-context modeling in transformer-based language models requires efficient attention mechanisms to mitigate the quadratic computational and memory overhead of dense softmax attention. Prior approaches typically substitute attention with linear or recurrent alternatives, apply fixed content-agnostic sparse patterns, or implement adaptive sparsification that either inherits full-attention training costs or suffers from fragmented memory access and unoptimized inference kernels. The authors leverage a per-GQA-group Top-k sharing strategy combined with block-level selection to preserve contiguous KV cache reads while maintaining adaptive context awareness. They further accelerate the framework by adapting the FlashAttention algorithmic skeleton with loop ordering specifically tuned to this access pattern, effectively translating theoretical FLOP reductions into measurable wall-clock speedups.

Experiment

Two 109B-scale experiments validate replacing dense attention with a sparse mechanism by either training from scratch or continuing pretraining from a full-attention checkpoint. The native sparse approach demonstrates that the model can stably adapt its representations to learn essential attention structures without hard-coded constraints, while the continued pretraining route validates a practical and stable conversion pathway from dense checkpoints. Both methods maintain competitive performance across language, multimodal, and long-context benchmarks despite a strict key-value token budget. Ultimately, the sparse architecture delivers substantial computational efficiency and sustained long-context capabilities, establishing it as a scalable alternative to dense attention.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp