Large Language Models Explained: Transformers, Scaling Laws, and RLHF

GPT-3 Used 45 Terabytes of Text to Learn Language

OpenAI's GPT-3, released in 2020 with 175 billion parameters, was trained on approximately 45 terabytes of filtered text — roughly the equivalent of 4.5 million books. It could write essays, generate code, translate languages, and answer questions without any task-specific fine-tuning. No system trained before 2017 could do this. The architectural innovation that made it possible — the Transformer — was introduced in a single 2017 paper: "Attention Is All You Need" by Vaswani, Shazeer, Parmar, and colleagues at Google Brain. Everything since — BERT, GPT-4, Claude, Gemini, Llama — is a descendant of that architecture.

The Transformer: Self-Attention as the Core Mechanism

Before Transformers, language models processed text sequentially — one word at a time — using recurrent neural networks (RNNs) or LSTMs. Long-range dependencies were difficult to capture because information had to propagate through many sequential steps. Transformers eliminated this bottleneck entirely through self-attention: a mechanism that allows every token in a sequence to directly attend to every other token simultaneously, computing relevance weights regardless of positional distance.

In self-attention, each input token generates three vectors: a Query (Q), Key (K), and Value (V). Attention scores are computed as softmax(QK^T / √d_k), where d_k is the key dimension. These scores determine how much each token's value contributes to another token's representation. Multi-head attention runs this process in parallel across multiple "heads" — allowing the model to simultaneously attend to syntactic relationships, semantic associations, and positional patterns within a single layer.

Architecture Component	Function	Why It Matters
Self-attention (multi-head)	Captures relationships between all tokens simultaneously	Enables long-range dependency modeling
Positional encoding	Adds position information (Transformers have no inherent order)	Distinguishes "dog bites man" from "man bites dog"
Feed-forward sublayers	Non-linear transformation of attention output	Adds representational capacity per layer
Layer normalization	Stabilizes activations during training	Enables training of very deep networks
Residual connections	Skip connections around each sublayer	Prevents vanishing gradients at scale

Scaling Laws and the Chinchilla Revelation

OpenAI's 2020 scaling laws paper (Kaplan et al.) established a power-law relationship between compute budget, model size, and training data: larger models, trained on more data, consistently produce better performance. The field interpreted this as a mandate to build ever-larger models. Then came DeepMind's Chinchilla paper (Hoffmann et al., 2022).

Chinchilla trained a 70-billion parameter model on 1.4 trillion tokens and outperformed GPT-3 (175B parameters, ~300B tokens) on virtually every benchmark. The finding: for a given compute budget, previous models were significantly undertrained. The "Chinchilla optimal" ratio suggests approximately 20 tokens of training data per model parameter. GPT-3 had only about 1.7 tokens per parameter. The implications reshaped LLM development: Meta's Llama series, Mistral, and subsequent Google models were all designed to the Chinchilla paradigm — smaller, more efficiently trained models that outperform larger undertrained ones.

Emergent Capabilities: The Threshold Effect

Some LLM capabilities appear abruptly at certain model scales — they are not present in smaller models and then gradually improve; they emerge suddenly when a threshold is crossed. Researchers at Google Brain documented this in a 2022 paper by Wei et al., identifying abilities including:

Multi-step arithmetic reasoning (chain-of-thought) — emerged around 100B parameters
Multi-language translation without explicit training on translation pairs
Code generation from natural language descriptions
Analogical reasoning in the style of IQ test problems

The interpretation of emergence is contested. Some researchers (Schaeffer et al., 2023) argue that emergence is an artifact of discontinuous evaluation metrics — that the underlying capabilities improve continuously and only the measurement threshold creates the appearance of sudden emergence. The debate has not been resolved, but it has significant implications for AI safety: capabilities that emerge unpredictably are harder to anticipate and constrain.

RLHF: Teaching Models to Be Helpful

Pretraining on internet text produces a model that predicts likely next tokens — including harmful, biased, or misleading text if that is statistically prevalent in the corpus. Reinforcement Learning from Human Feedback (RLHF), pioneered by OpenAI and applied to InstructGPT (Ouyang et al., 2022), adds three post-training steps: supervised fine-tuning on human-written demonstrations; training a reward model on human preference rankings of model outputs; and optimizing the language model against the reward model using proximal policy optimization (PPO). RLHF dramatically improved the helpfulness, harmlessness, and honesty of model outputs — transforming raw pretrained models into usable assistants. Constitutional AI (Anthropic), direct preference optimization (DPO), and other RLHF alternatives have since extended this paradigm.

Why LLMs Hallucinate

Hallucination — generating confident but factually incorrect statements — is not a bug to be patched. It is a structural consequence of how LLMs are trained. The model learns to produce statistically plausible continuations of text, not to verify factual accuracy against a ground truth. Several mechanisms drive hallucination:

Knowledge cutoff: Training data has a cutoff date; the model cannot know events after that date but will still generate plausible-sounding text about them.
Rare knowledge: Accurate information about obscure topics appears infrequently in training data; the model has weaker statistical support for accurate recall.
Sycophantic reinforcement: RLHF reward models may inadvertently reward confident, fluent responses over accurate but hedged ones if human raters prefer confident-sounding answers.
No retrieval mechanism: Base LLMs have no ability to "look up" facts at inference time; everything must be in weights. Retrieval-augmented generation (RAG) partially addresses this.

LLM	Parameters	Training Tokens	Release Year
GPT-3 (OpenAI)	175B	~300B	2020
PaLM (Google)	540B	780B	2022
Chinchilla (DeepMind)	70B	1.4T	2022
GPT-4 (OpenAI)	~1.8T (MoE, est.)	~13T (est.)	2023
Llama 3.1 (Meta)	405B	15T+	2024

Large Language Models Explained: Transformers, Scaling Laws, and RLHF

GPT-3 Used 45 Terabytes of Text to Learn Language

The Transformer: Self-Attention as the Core Mechanism

Scaling Laws and the Chinchilla Revelation

Emergent Capabilities: The Threshold Effect

RLHF: Teaching Models to Be Helpful

Why LLMs Hallucinate

Related Articles

AI Ethics: Bias, Fairness, Accountability, and the Governance Challenge

The History of AI: From Turing's Test to ChatGPT (Part 2)

Neural Networks for Beginners: How AI Mimics the Brain (Part 5)

Generative AI Explained: How ChatGPT and Image Generators Work (Part 8)