The Evolution of Attention: From MLA to NSA

Step Snap 1 [The Attention Bottleneck Problem]

1. The Hidden Challenge in AI Text Generation Imagine you're having a conversation with an AI. Every time it responds, a fascinating but invisible challenge occurs:

  • Memory Explosion: The AI must remember everything said so far (stored as "key-value pairs")

  • Quadratic Growth: The longer the conversation, the exponentially more calculations needed

  • Resource Limitation: GPUs have limited high-speed memory (SRAM) to process all this data

Why This Matters:

  • Longer conversations become dramatically slower

  • Each active conversation consumes massive GPU memory

  • Fewer users can be served simultaneously

  • Cloud computing costs skyrocket

Step Snap 2 [First-Generation Solutions: Trading Quality for Speed]

1. The Early Compromises The first attempts to solve this problem took a straightforward approach - use fewer key-value pairs:

Multi-Query Attention (MQA):

  • The Approach: All query heads share a single key-value pair

  • Memory Savings: Dramatic (~8x less memory than standard attention)

  • The Downside: Significant model quality degradation

Grouped-Query Attention (GQA):

  • The Approach: Groups of query heads share key-value pairs

  • Memory Savings: Substantial (2-4x less memory)

  • The Downside: Still compromises quality, though less than MQA

The Tradeoff Problem: Like trying to watch a movie with either fewer pixels or fewer frames - you save resources but lose quality.

Step Snap 3 [DeepSeek's MLA: The Compression Breakthrough]

1. The Clever Data Compression Trick DeepSeek's Multi-head Latent Attention (MLA) took a fundamentally different approach:

The Core Innovation:

  • Compress, Don't Eliminate: Transform input vectors into smaller "latent" representations

  • Store the Compressed Version: Cache only these compact latent vectors

  • Decompress When Needed: Expand them back to full size during calculation

The Streaming Video Analogy:

  • MQA/GQA: Like watching a video at lower resolution all the time

  • MLA: Like streaming a compressed video file and decompressing it only when watching

The Technical Magic:

  • Down-project high-dimensional vectors to low-dimensional latent space

  • Store only these smaller vectors in memory

  • Use up-projection matrices to reconstruct full information when needed

Step Snap 4 [MLA's Technical Innovation]

1. The Mathematical Sleight of Hand The real genius of MLA lies in how it handles the decompression process:

The "Matrix Absorption" Trick:

  • Clever matrix manipulations allow up-projection matrices to be "absorbed" into other calculations

  • This makes decompression essentially free in computational terms

The Position Encoding Challenge:

  • Rotary Position Embeddings (RoPE) initially broke the "absorption" trick

  • DeepSeek invented "decoupled RoPE" to solve this problem

  • This allows position information to flow correctly while maintaining compression benefits

The Result:

  • Up to 5x less memory usage

  • No quality degradation - sometimes even better than standard attention

  • Model inference becomes dramatically faster

Step Snap 5 [NSA: The Next Evolution]

1. From Memory Efficiency to Computational Efficiency While MLA brilliantly solves the memory problem, DeepSeek's Native Sparse Attention (NSA) tackles the computational challenge:

The Fundamental Insight:

  • Even with MLA's memory efficiency, the attention calculation itself is still O(n²)

  • As context length grows, the computational cost becomes the bottleneck

  • What if we could selectively calculate attention only for the most important tokens?

The NSA Approach:

  • Native: Build sparsity into the model from the beginning, not as an afterthought

  • Sparse: Calculate attention for only a subset of tokens

  • Attention: Maintain the core attention mechanism, just with fewer calculations

Why "Native" Matters:

  • Previous attempts at sparse attention like Longformer, ETC, and Windowformer were applied after training

  • Post-training sparsification is fragile - research shows top 20% attention only covers 70% of total attention scores

  • Training with native sparsity makes the model inherently adapting to sparse patterns from the beginning

Step Snap 6 [NSA's Triple Attention Strategy]

1. Three Complementary Attention Types NSA cleverly combines three different attention mechanisms to capture information at different scales:

Compression Attention:

  • Divides the sequence into blocks and compresses each block

  • Captures global patterns across the entire sequence

  • Acts as a "big picture" view of the entire context

  • Similar to MLA's compression but for a different purpose

Selection Attention:

  • Identifies the most important blocks for each query

  • Focuses computational resources on high-value information

  • Selects specific blocks for detailed attention based on relevance

  • The "smart filtering" component that dramatically reduces computation

Window Attention:

  • Attends to tokens in a local window around each position

  • Ensures recent context is never lost

  • Provides a "safety net" for critical nearby information

  • Based on the intuition that nearby tokens are often most relevant

The Gating Mechanism:

  • Learned weights determine how to combine these three attention types

  • Different contexts may rely more on different attention mechanisms

  • Makes the model adaptively balance global, selective, and local views

  • Optimized during training to maximize performance while minimizing computation

Step Snap 7 [Comparing MLA and NSA]

1. Complementary Solutions to Different Problems

MLA's Focus:

  • Primary Goal: Reduce memory consumption

  • Main Benefit: Allows larger batch sizes and longer contexts

  • Technical Approach: Compression of key-value pairs

  • When It Helps Most: Inference stage (especially with KV cache)

  • Performance Impact: Maintains quality while reducing memory footprint

NSA's Focus:

  • Primary Goal: Reduce computational complexity

  • Main Benefit: Faster processing of long contexts

  • Technical Approach: Selective calculation of attention

  • When It Helps Most: Both training and inference

  • Performance Impact: 9x forward pass speedup, 6x backward pass speedup in 64K contexts

The Implementation Edge:

  • NSA uses specialized Triton kernels for optimal GPU utilization

  • Reduces SRAM access by grouping query heads within GQA groups

  • Optimizes for both NVIDIA H100/H800 architectures

  • Particularly effective for reasoning (R1) scenarios requiring long context

Can They Work Together?

  • Yes! MLA and NSA solve different aspects of the same problem

  • They can be combined for both memory efficiency and computational efficiency

  • Together, they enable truly efficient long-context AI systems

  • DeepSeek likely to incorporate both in future model architectures

Step Snap 8 [The Performance Revolution]

1. Real-World Impact of NSA The NSA architecture delivers remarkable performance improvements:

Benchmark Results:

  • 64K Context: 9x faster forward pass, 6x faster backward pass compared to Flash Attention

  • Training Speed: Significantly reduced training time (similar speedups)

  • Quality: No degradation in model capabilities - sometimes even improved performance

  • Hardware Utilization: Much more efficient use of GPU resources

The Scaling Benefits:

  • 64K context processing with the efficiency of previously handling ~7K tokens

  • Particularly valuable for reasoning-intensive tasks requiring long context (R1 paradigm)

  • As context length increases, the benefits become even more pronounced

  • Potentially saving thousands of GPU hours in large-scale model training

Step Snap 9 [The Future of Attention]

1. Beyond the Horizon MLA and NSA represent critical steps in attention evolution, but they're not the end:

The Common Pattern:

  • Both solutions found ways to preserve quality while eliminating redundancy

  • Both demonstrate that attention can be optimized without sacrificing performance

  • Both suggest that further efficiency gains are still possible

The Path Forward:

  • DeepSeek is already implementing these technologies in production models

  • On H100/H800 GPUs, the performance gains are revolutionary (100-200x speedups)

  • Future optimization may achieve 2,000-3,000x improvements for ultra-long contexts

MLA and NSA show us that the fundamental architecture of language models still has room for radical improvement. Rather than just scaling models larger, these innovations make existing models dramatically more efficient - a key step toward making advanced AI more accessible and affordable.

Last updated