Large Language Models Explained: Transformers, Scaling Laws, and RLHF
How large language models work — transformer self-attention, Chinchilla scaling laws, emergent capability thresholds, RLHF alignment, and the root causes of hallucination.
GPT-3 Used 45 Terabytes of Text to Learn Language
OpenAI's GPT-3, released in 2020 with 175 billion parameters, was trained on approximately 45 terabytes of filtered text — roughly the equivalent of 4.5 million books. It could write essays, generate code, translate languages, and answer questions without any task-specific fine-tuning. No system trained before 2017 could do this. The architectural innovation that made it possible — the Transformer — was introduced in a single 2017 paper: "Attention Is All You Need" by Vaswani, Shazeer, Parmar, and colleagues at Google Brain. Everything since — BERT, GPT-4, Claude, Gemini, Llama — is a descendant of that architecture.
The Transformer: Self-Attention as the Core Mechanism
Before Transformers, language models processed text sequentially — one word at a time — using recurrent neural networks (RNNs) or LSTMs. Long-range dependencies were difficult to capture because information had to propagate through many sequential steps. Transformers eliminated this bottleneck entirely through self-attention: a mechanism that allows every token in a sequence to directly attend to every other token simultaneously, computing relevance weights regardless of positional distance.
In self-attention, each input token generates three vectors: a Query (Q), Key (K), and Value (V). Attention scores are computed as softmax(QK^T / √d_k), where d_k is the key dimension. These scores determine how much each token's value contributes to another token's representation. Multi-head attention runs this process in parallel across multiple "heads" — allowing the model to simultaneously attend to syntactic relationships, semantic associations, and positional patterns within a single layer.
| Architecture Component | Function | Why It Matters |
|---|---|---|
| Self-attention (multi-head) | Captures relationships between all tokens simultaneously | Enables long-range dependency modeling |
| Positional encoding | Adds position information (Transformers have no inherent order) | Distinguishes "dog bites man" from "man bites dog" |
| Feed-forward sublayers | Non-linear transformation of attention output | Adds representational capacity per layer |
| Layer normalization | Stabilizes activations during training | Enables training of very deep networks |
| Residual connections | Skip connections around each sublayer | Prevents vanishing gradients at scale |
Scaling Laws and the Chinchilla Revelation
OpenAI's 2020 scaling laws paper (Kaplan et al.) established a power-law relationship between compute budget, model size, and training data: larger models, trained on more data, consistently produce better performance. The field interpreted this as a mandate to build ever-larger models. Then came DeepMind's Chinchilla paper (Hoffmann et al., 2022).
Chinchilla trained a 70-billion parameter model on 1.4 trillion tokens and outperformed GPT-3 (175B parameters, ~300B tokens) on virtually every benchmark. The finding: for a given compute budget, previous models were significantly undertrained. The "Chinchilla optimal" ratio suggests approximately 20 tokens of training data per model parameter. GPT-3 had only about 1.7 tokens per parameter. The implications reshaped LLM development: Meta's Llama series, Mistral, and subsequent Google models were all designed to the Chinchilla paradigm — smaller, more efficiently trained models that outperform larger undertrained ones.
Emergent Capabilities: The Threshold Effect
Some LLM capabilities appear abruptly at certain model scales — they are not present in smaller models and then gradually improve; they emerge suddenly when a threshold is crossed. Researchers at Google Brain documented this in a 2022 paper by Wei et al., identifying abilities including:
- Multi-step arithmetic reasoning (chain-of-thought) — emerged around 100B parameters
- Multi-language translation without explicit training on translation pairs
- Code generation from natural language descriptions
- Analogical reasoning in the style of IQ test problems
The interpretation of emergence is contested. Some researchers (Schaeffer et al., 2023) argue that emergence is an artifact of discontinuous evaluation metrics — that the underlying capabilities improve continuously and only the measurement threshold creates the appearance of sudden emergence. The debate has not been resolved, but it has significant implications for AI safety: capabilities that emerge unpredictably are harder to anticipate and constrain.
RLHF: Teaching Models to Be Helpful
Pretraining on internet text produces a model that predicts likely next tokens — including harmful, biased, or misleading text if that is statistically prevalent in the corpus. Reinforcement Learning from Human Feedback (RLHF), pioneered by OpenAI and applied to InstructGPT (Ouyang et al., 2022), adds three post-training steps: supervised fine-tuning on human-written demonstrations; training a reward model on human preference rankings of model outputs; and optimizing the language model against the reward model using proximal policy optimization (PPO). RLHF dramatically improved the helpfulness, harmlessness, and honesty of model outputs — transforming raw pretrained models into usable assistants. Constitutional AI (Anthropic), direct preference optimization (DPO), and other RLHF alternatives have since extended this paradigm.
Why LLMs Hallucinate
Hallucination — generating confident but factually incorrect statements — is not a bug to be patched. It is a structural consequence of how LLMs are trained. The model learns to produce statistically plausible continuations of text, not to verify factual accuracy against a ground truth. Several mechanisms drive hallucination:
- Knowledge cutoff: Training data has a cutoff date; the model cannot know events after that date but will still generate plausible-sounding text about them.
- Rare knowledge: Accurate information about obscure topics appears infrequently in training data; the model has weaker statistical support for accurate recall.
- Sycophantic reinforcement: RLHF reward models may inadvertently reward confident, fluent responses over accurate but hedged ones if human raters prefer confident-sounding answers.
- No retrieval mechanism: Base LLMs have no ability to "look up" facts at inference time; everything must be in weights. Retrieval-augmented generation (RAG) partially addresses this.
| LLM | Parameters | Training Tokens | Release Year |
|---|---|---|---|
| GPT-3 (OpenAI) | 175B | ~300B | 2020 |
| PaLM (Google) | 540B | 780B | 2022 |
| Chinchilla (DeepMind) | 70B | 1.4T | 2022 |
| GPT-4 (OpenAI) | ~1.8T (MoE, est.) | ~13T (est.) | 2023 |
| Llama 3.1 (Meta) | 405B | 15T+ | 2024 |
Related Articles
artificial intelligence
AI Ethics: Bias, Fairness, Accountability, and the Governance Challenge
AI systems can embed and amplify human biases, produce discriminatory outcomes, and evade accountability. Explore the core ethical challenges in AI development, from algorithmic fairness to governance frameworks shaping the future of the technology.
11 min read
artificial intelligence
The History of AI: From Turing's Test to ChatGPT (Part 2)
Artificial intelligence has a richer and more turbulent history than most people realize, stretching back more than seventy years. This article traces the key breakthroughs, painful setbacks, and unexpected leaps that brought us from Alan Turing's 1950 thought experiment to the ChatGPT era.
8 min read
artificial intelligence
Neural Networks for Beginners: How AI Mimics the Brain (Part 5)
Neural networks are the engine behind most modern AI, from image recognition to language generation. This beginner-friendly guide explains neurons, layers, weights, activation functions, and the training process in plain language — no math required.
8 min read
artificial intelligence
Generative AI Explained: How ChatGPT and Image Generators Work (Part 8)
Generative AI can write essays, compose code, paint images, and hold conversations — but how does it actually work? This article demystifies large language models, diffusion-based image generators, and the art and science of prompting.
8 min read