LLM Architecture

Introduction

Large Language Models (LLMs) are built upon sophisticated neural network architectures that have evolved dramatically in recent years. This document explores the core architectural components and innovations that power today's most advanced language models.

Transformer Architecture

The foundation of modern LLMs is the Transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al.

Key Components

  1. Self-Attention Mechanism

    • Allows the model to weigh the importance of different words in a sentence

    • Captures long-range dependencies that were difficult for previous architectures

    • Enables parallel processing of sequence data

  2. Multi-Head Attention

    • Splits attention into multiple "heads" that can focus on different aspects of the input

    • Enhances the model's ability to capture various types of relationships in the data

  3. Feed-Forward Networks

    • Two-layer neural networks applied to each position independently

    • Add non-linearity and increase the model's representational power

  4. Layer Normalization and Residual Connections

    • Stabilize and accelerate training

    • Allow for building very deep networks without gradient problems

Architectural Variations

Decoder-Only Models

  • GPT family, LLaMA, Mistral

  • Trained to predict the next token in a sequence

  • Well-suited for generative tasks

Encoder-Only Models

  • BERT, RoBERTa

  • Bidirectional context processing

  • Effective for classification and understanding tasks

Encoder-Decoder Models

  • T5, BART

  • Combined architecture for sequence-to-sequence tasks

  • Used for translation, summarization, etc.

LLM architecture development has been characterized by a "scaling race" with models growing from:

  • BERT (340M parameters)

  • GPT-3 (175B parameters)

  • PaLM (540B parameters)

  • GPT-4 (estimated trillions of parameters)

Recent Innovations

Mixture of Experts (MoE)

  • Activates only a subset of the network for each input

  • Allows for much larger models with similar computational costs

  • Examples: Mixtral, Switch Transformers

State Space Models

  • Mamba, Hyena

  • Alternative to attention that scales linearly with sequence length

  • Promising for handling very long contexts

Future Directions

  • More efficient attention mechanisms

  • Sparse activation patterns

  • Hardware-aware architecture design

  • Modular architectures with specialized components

References

  1. Vaswani, A., et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762

  2. Brown, T., et al. (2020). Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165

  3. Chowdhery, A., et al. (2022). PaLM: Scaling Language Modeling with Pathways. https://arxiv.org/abs/2204.02311

Last updated