Large Language Models Explained: Transformers, Scaling Laws, and RLHF

How large language models work — transformer self-attention, Chinchilla scaling laws, emergent capability thresholds, RLHF alignment, and the root causes of hallucination.

The InfoNexus Editorial TeamMay 23, 20269 min read

GPT-3 Used 45 Terabytes of Text to Learn Language

OpenAI's GPT-3, released in 2020 with 175 billion parameters, was trained on approximately 45 terabytes of filtered text — roughly the equivalent of 4.5 million books. It could write essays, generate code, translate languages, and answer questions without any task-specific fine-tuning. No system trained before 2017 could do this. The architectural innovation that made it possible — the Transformer — was introduced in a single 2017 paper: "Attention Is All You Need" by Vaswani, Shazeer, Parmar, and colleagues at Google Brain. Everything since — BERT, GPT-4, Claude, Gemini, Llama — is a descendant of that architecture.

The Transformer: Self-Attention as the Core Mechanism

Before Transformers, language models processed text sequentially — one word at a time — using recurrent neural networks (RNNs) or LSTMs. Long-range dependencies were difficult to capture because information had to propagate through many sequential steps. Transformers eliminated this bottleneck entirely through self-attention: a mechanism that allows every token in a sequence to directly attend to every other token simultaneously, computing relevance weights regardless of positional distance.

In self-attention, each input token generates three vectors: a Query (Q), Key (K), and Value (V). Attention scores are computed as softmax(QK^T / √d_k), where d_k is the key dimension. These scores determine how much each token's value contributes to another token's representation. Multi-head attention runs this process in parallel across multiple "heads" — allowing the model to simultaneously attend to syntactic relationships, semantic associations, and positional patterns within a single layer.

Architecture ComponentFunctionWhy It Matters
Self-attention (multi-head)Captures relationships between all tokens simultaneouslyEnables long-range dependency modeling
Positional encodingAdds position information (Transformers have no inherent order)Distinguishes "dog bites man" from "man bites dog"
Feed-forward sublayersNon-linear transformation of attention outputAdds representational capacity per layer
Layer normalizationStabilizes activations during trainingEnables training of very deep networks
Residual connectionsSkip connections around each sublayerPrevents vanishing gradients at scale

Scaling Laws and the Chinchilla Revelation

OpenAI's 2020 scaling laws paper (Kaplan et al.) established a power-law relationship between compute budget, model size, and training data: larger models, trained on more data, consistently produce better performance. The field interpreted this as a mandate to build ever-larger models. Then came DeepMind's Chinchilla paper (Hoffmann et al., 2022).

Chinchilla trained a 70-billion parameter model on 1.4 trillion tokens and outperformed GPT-3 (175B parameters, ~300B tokens) on virtually every benchmark. The finding: for a given compute budget, previous models were significantly undertrained. The "Chinchilla optimal" ratio suggests approximately 20 tokens of training data per model parameter. GPT-3 had only about 1.7 tokens per parameter. The implications reshaped LLM development: Meta's Llama series, Mistral, and subsequent Google models were all designed to the Chinchilla paradigm — smaller, more efficiently trained models that outperform larger undertrained ones.

Emergent Capabilities: The Threshold Effect

Some LLM capabilities appear abruptly at certain model scales — they are not present in smaller models and then gradually improve; they emerge suddenly when a threshold is crossed. Researchers at Google Brain documented this in a 2022 paper by Wei et al., identifying abilities including:

  • Multi-step arithmetic reasoning (chain-of-thought) — emerged around 100B parameters
  • Multi-language translation without explicit training on translation pairs
  • Code generation from natural language descriptions
  • Analogical reasoning in the style of IQ test problems

The interpretation of emergence is contested. Some researchers (Schaeffer et al., 2023) argue that emergence is an artifact of discontinuous evaluation metrics — that the underlying capabilities improve continuously and only the measurement threshold creates the appearance of sudden emergence. The debate has not been resolved, but it has significant implications for AI safety: capabilities that emerge unpredictably are harder to anticipate and constrain.

RLHF: Teaching Models to Be Helpful

Pretraining on internet text produces a model that predicts likely next tokens — including harmful, biased, or misleading text if that is statistically prevalent in the corpus. Reinforcement Learning from Human Feedback (RLHF), pioneered by OpenAI and applied to InstructGPT (Ouyang et al., 2022), adds three post-training steps: supervised fine-tuning on human-written demonstrations; training a reward model on human preference rankings of model outputs; and optimizing the language model against the reward model using proximal policy optimization (PPO). RLHF dramatically improved the helpfulness, harmlessness, and honesty of model outputs — transforming raw pretrained models into usable assistants. Constitutional AI (Anthropic), direct preference optimization (DPO), and other RLHF alternatives have since extended this paradigm.

Why LLMs Hallucinate

Hallucination — generating confident but factually incorrect statements — is not a bug to be patched. It is a structural consequence of how LLMs are trained. The model learns to produce statistically plausible continuations of text, not to verify factual accuracy against a ground truth. Several mechanisms drive hallucination:

  • Knowledge cutoff: Training data has a cutoff date; the model cannot know events after that date but will still generate plausible-sounding text about them.
  • Rare knowledge: Accurate information about obscure topics appears infrequently in training data; the model has weaker statistical support for accurate recall.
  • Sycophantic reinforcement: RLHF reward models may inadvertently reward confident, fluent responses over accurate but hedged ones if human raters prefer confident-sounding answers.
  • No retrieval mechanism: Base LLMs have no ability to "look up" facts at inference time; everything must be in weights. Retrieval-augmented generation (RAG) partially addresses this.
LLMParametersTraining TokensRelease Year
GPT-3 (OpenAI)175B~300B2020
PaLM (Google)540B780B2022
Chinchilla (DeepMind)70B1.4T2022
GPT-4 (OpenAI)~1.8T (MoE, est.)~13T (est.)2023
Llama 3.1 (Meta)405B15T+2024
LLMAItransformers

Related Articles