How Machine Learning Models Learn From Data and Make Predictions
Machine learning models find patterns in data through optimization algorithms. Learn how neural networks, gradient descent, and training produce systems that make predictions.
Learning Without Being Explicitly Programmed
In 2012, a neural network called AlexNet entered the ImageNet Large Scale Visual Recognition Challenge and reduced the top-5 error rate from 26% to 15% — a larger single-year improvement than the previous five years combined. The network had never been told the rules for recognizing dogs, cats, or cars. It had been shown 1.2 million labeled images and adjusted its own parameters until it got better at the task. AlexNet triggered the current era of deep learning. Its architecture, its training approach, and the computational pattern it validated now underlie systems that translate languages, generate images, diagnose medical images, and power the large language models transforming how humanity interacts with software.
Machine learning is not a single algorithm but a family of mathematical approaches united by one principle: rather than writing explicit rules, build a system that finds structure in data and generalizes that structure to new examples. The distinction from traditional programming is fundamental. A spam filter written with explicit rules enumerates characteristics of spam. A machine learning spam filter is trained on millions of labeled emails and discovers characteristics the programmer might never have articulated — and adapts when spammers change tactics.
Three Learning Paradigms
Supervised learning trains on labeled examples — input-output pairs where the correct output is provided. The model learns a function mapping inputs to outputs. Applications: image classification, speech recognition, medical diagnosis, fraud detection, price prediction.
Unsupervised learning finds structure in unlabeled data — patterns, groupings, or compressed representations without being told what to look for. Applications: customer segmentation, anomaly detection, dimensionality reduction, generative modeling.
Reinforcement learning trains an agent to take actions in an environment to maximize cumulative reward, without labeled training data. The agent learns from feedback — reward or penalty — based on outcomes of its actions. Applications: game-playing AI (AlphaGo, OpenAI Five), robotics control, resource optimization.
How a Neural Network Learns
A neural network is a function approximator. It consists of layers of interconnected computational units (neurons), each performing a simple weighted sum followed by a nonlinear activation function. Stacking many layers — deep networks — allows the approximation of arbitrarily complex functions.
A single neuron computes: output = f(w₁x₁ + w₂x₂ + ... + wₙxₙ + b), where x are inputs, w are weights, b is a bias, and f is a nonlinear activation function (ReLU, sigmoid, tanh, etc.). The weights and biases are the learned parameters — there are billions of them in a large modern network.
Training: Minimizing the Loss
Training begins with random parameter initialization. For each training example, the network makes a prediction, the prediction is compared to the true label using a loss function (mean squared error for regression, cross-entropy for classification), and the loss quantifies how wrong the prediction was. The goal is to adjust parameters to minimize average loss over the training set.
Parameters are adjusted using gradient descent: compute the gradient of the loss with respect to every parameter (how much each parameter change affects the loss), then shift each parameter a small amount in the direction that reduces loss. The step size is the learning rate — a hyperparameter that requires careful tuning. Too large and training oscillates; too small and training takes impractically long.
Backpropagation is the algorithm for efficiently computing gradients in neural networks. Applying the chain rule of calculus, it propagates gradient information backward from the output layer through each layer to the input, computing each parameter's gradient in one backward pass. Geoffrey Hinton, David Rumelhart, and Ronald Williams popularized the backpropagation algorithm for neural networks in their 1986 paper; Hinton won the Nobel Prize in Physics in 2024 for his foundational contributions to deep learning.
Key Training Concepts
| Concept | Definition | Problem It Addresses | Common Implementation |
|---|---|---|---|
| Batch size | Number of examples per gradient update | Balances noise vs. computation per update | Mini-batches of 32–512 |
| Epoch | One full pass through the training data | Training typically requires many epochs | 10–100+ epochs typical |
| Regularization | Techniques to prevent overfitting | Overfitting: model memorizes training data | Dropout, L2 weight decay, data augmentation |
| Validation set | Held-out data to tune hyperparameters | Prevents tuning to test set | Typically 10–20% of training data |
| Learning rate schedule | Decaying learning rate during training | Fine-grained adjustment as training progresses | Cosine annealing, warmup+decay |
Overfitting and Generalization
A model that memorizes training data perfectly but fails on new examples has overfit. It has learned noise and specific patterns rather than general structure. A model that performs similarly on training and test data generalizes well. The balance between model capacity (size, complexity) and training data size determines whether overfitting occurs.
- Dropout: during training, randomly set a fraction of neurons to zero on each forward pass. This prevents neurons from co-adapting and forces the network to learn redundant representations. Introduced by Srivastava et al. (2014), dropout is now standard in training large networks.
- Data augmentation: create additional training examples by applying transformations — flipping images, adding noise, rotating, translating — that don't change the underlying class. Effectively multiplies dataset size without collecting new data.
- Early stopping: monitor validation loss during training and stop when it begins rising (indicating overfitting), even if training loss continues to fall.
Architecture Innovations That Changed the Field
| Architecture | Year | Key Innovation | Primary Application |
|---|---|---|---|
| Convolutional Neural Network (CNN) | 1989/1998 | Local filters + weight sharing; translation invariance | Image recognition, video |
| Long Short-Term Memory (LSTM) | 1997 | Gated memory cells for long-range sequence dependencies | Speech, language, time series |
| ResNet | 2015 | Skip connections enabling 100+ layer deep networks | Image recognition, vision backbones |
| Transformer | 2017 | Self-attention mechanism; parallelizable over sequence positions | NLP, vision, multimodal models |
| Diffusion model | 2020 | Iterative denoising for generative modeling | Image/video/audio generation |
The Transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," replaced recurrent networks for sequence modeling with a self-attention mechanism that computes relationships between all positions in a sequence simultaneously. This parallelizability made training on massive datasets practical and enabled the scaling that produced GPT, BERT, PaLM, Gemini, and Claude. GPT-3 (2020) had 175 billion parameters; modern frontier models exceed a trillion.
The Scaling Laws and Emergent Behavior
A 2020 paper from OpenAI (Kaplan et al.) demonstrated that language model performance follows predictable power laws as model size, training data, and compute scale up — enabling extrapolation of how much improvement a given investment in scale will produce. This insight validated the strategy of investing billions in model training.
Larger models also exhibit emergent capabilities — qualitative abilities (chain-of-thought reasoning, in-context learning, code generation) that appear suddenly at threshold scales and are absent in smaller models. This emergent behavior is not fully understood theoretically, raising both excitement and uncertainty: we can build systems capable of tasks their creators didn't specifically train them for, but we cannot yet reliably predict what capabilities will emerge at what scale or how to ensure those capabilities align with intended uses. The mathematics of learning — gradient descent, backpropagation, stochastic optimization — turns out to be powerful enough to extract structure from data at scales that produce qualitatively new kinds of behavior. That combination is what makes the current period in machine learning genuinely unprecedented.
Related Articles
applied mathematics
Bayes' Theorem: How to Update Beliefs With New Evidence
Bayes' theorem describes how to rationally update probability estimates when new evidence arrives. Learn the formula, its intuition, and its applications in medicine and AI.
9 min read
applied mathematics
Game Theory Explained: Nash Equilibria, Prisoner's Dilemma, and Strategic Decision-Making
A comprehensive introduction to game theory — the mathematics of strategic decision-making — covering the Prisoner's Dilemma, Nash equilibria, dominant strategies, cooperative vs. non-cooperative games, auctions, evolutionary game theory, and real-world applications from economics to nuclear deterrence.
9 min read
applied mathematics
How Bayesian Statistics Updates Beliefs With New Evidence
Bayesian statistics provides a mathematical framework for updating beliefs as evidence arrives. From spam filters to medical screening, Bayes' theorem shapes modern inference.
9 min read
applied mathematics
How Compound Interest Works: The Math Behind Exponential Growth
Compound interest grows exponentially because interest earns interest over time. Learn the formula, the Rule of 72, and why starting early makes such an enormous financial difference.
8 min read