How Deep Learning Works: Layers, Weights, and Gradient Descent

The Technique That Taught Computers to Recognize Cats on YouTube

In 2012, Google Brain researchers Andrew Ng and Jeff Dean built a neural network with 16,000 computer processors and trained it on 10 million YouTube thumbnail images without labels. The network spontaneously developed a neuron that responded strongly to cat faces — one of the most common subjects in online video — without ever being told what a cat was. The experiment demonstrated that deep neural networks could autonomously learn high-level concepts from raw data at scale. That same year, deep learning achieved superhuman performance on the ImageNet visual recognition benchmark. Within a decade, deep learning transformed every field of AI, from speech recognition to protein structure prediction.

The Architecture of a Deep Neural Network

A deep neural network (DNN) consists of an input layer, multiple hidden layers, and an output layer. Each layer contains units (neurons) that receive weighted inputs from the previous layer, apply a nonlinear activation function, and pass results to the next layer.

Input layer: Receives raw data — pixel values for images, token embeddings for text, sensor readings for time series. No computation occurs here; values are passed directly forward.
Hidden layers: The "deep" in deep learning. Each hidden layer learns increasingly abstract representations of the input. Early layers detect simple features (edges, frequencies); later layers combine these into complex concepts (faces, words, intentions). A typical modern architecture may have 12–1,000+ hidden layers.
Output layer: Produces the final prediction. For classification, a softmax activation produces a probability distribution over classes. For regression, a linear activation outputs a continuous value.

Each connection between neurons has an associated weight (a floating-point number). The network also has bias terms at each layer. Together, these parameters — potentially billions in large models — determine the network's behavior. Training means finding the parameter values that minimize prediction error on the training data.

Activation Functions: Introducing Nonlinearity

Without nonlinear activation functions, a network with multiple layers is mathematically equivalent to a single linear layer and cannot learn complex patterns. Common activation functions include:

Function	Formula	Range	Common Use
ReLU	max(0, x)	[0, ∞)	Hidden layers (most CNNs, fully connected)
Leaky ReLU	max(0.01x, x)	(−∞, ∞)	Addresses "dying ReLU" problem
GELU	x × Φ(x)	(−∞, ∞)	Transformer hidden layers (BERT, GPT)
Sigmoid	1/(1+e⁻ˣ)	(0, 1)	Binary classification outputs
Softmax	eˣⁱ/Σeˣʲ	(0, 1), sums to 1	Multi-class classification outputs
Tanh	(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)	(−1, 1)	RNNs, older architectures

Loss Functions: Measuring Prediction Error

The loss function quantifies how wrong the network's predictions are. Training seeks to minimize the loss. Common loss functions:

Cross-entropy loss (log loss): For classification: L = −Σ yᵢ log(ŷᵢ), where y is the true label and ŷ is the predicted probability. Penalizes confident wrong predictions heavily.
Mean Squared Error (MSE): For regression: L = (1/n)Σ(y − ŷ)². Penalizes large errors quadratically.
Binary cross-entropy: For binary classification: L = −[y log(ŷ) + (1−y) log(1−ŷ)].

Backpropagation: How Networks Learn

Backpropagation is the algorithm that computes the gradient of the loss function with respect to every weight in the network. It uses the chain rule of calculus to propagate error signals backward from the output layer through each hidden layer to the input, efficiently computing partial derivatives for all parameters in a single backward pass.

The process per training step:

Forward pass: Input data flows through the network; output is computed.
Loss computation: Compare output to true label; compute loss.
Backward pass: Compute gradient of loss with respect to each weight using the chain rule, propagating backward from output to input.
Weight update: Adjust each weight by a small step in the direction that reduces the loss: w ← w − η × ∂L/∂w, where η is the learning rate.

Gradient Descent and Optimization

Gradient descent iteratively adjusts weights to minimize the loss function. Several variants exist with different computational efficiency and convergence properties:

Method	Batch Size	Update Frequency	Characteristics
Batch Gradient Descent	Full dataset	Once per epoch	Stable but slow for large datasets
Stochastic Gradient Descent (SGD)	1 sample	Every sample	Noisy updates; can escape local minima
Mini-batch SGD	32–512 samples	Every mini-batch	Standard approach; GPU-efficient
Adam	Mini-batch	Every mini-batch	Adaptive learning rates; widely used for LLMs
AdamW	Mini-batch	Every mini-batch	Adam + weight decay regularization; standard for transformers

Overfitting and Regularization

A model with too many parameters relative to training data can memorize training examples rather than learning generalizable patterns — this is overfitting. Techniques to combat it:

Dropout: During training, randomly deactivate a fraction (typically 10–50%) of neurons at each forward pass. Forces the network to learn redundant representations. Halving output at test time to match expected value.
Weight decay (L2 regularization): Adds λ||w||² to the loss, penalizing large weights and encouraging simpler functions.
Data augmentation: Artificially expand the training set by applying random transformations (image flips, crops, brightness changes) that preserve labels.
Early stopping: Monitor validation loss during training; stop when it begins to increase, indicating overfitting.
Batch normalization: Normalize layer inputs to zero mean and unit variance within each mini-batch; accelerates training and acts as implicit regularization.

Why Deep Networks Need GPUs

Training a modern deep learning model involves billions of floating-point multiplications per forward pass, repeated millions of times over training. A GPU with thousands of cores (an NVIDIA H100 has 16,896 CUDA cores) can execute these matrix multiplications in parallel, providing ~100–1,000x speedup over a CPU for deep learning workloads. The H100 GPU delivers approximately 1,979 TFLOPS (teraflops) of BF16 tensor operations — the core computation in modern neural networks. Training GPT-4 reportedly required approximately 25,000 A100 GPUs running for about 90 days, consuming roughly 50 GWh of electricity. The availability of massive GPU clusters is now a primary determinant of which organizations can develop frontier AI systems.

How Deep Learning Works: Layers, Weights, and Gradient Descent

The Technique That Taught Computers to Recognize Cats on YouTube

The Architecture of a Deep Neural Network

Activation Functions: Introducing Nonlinearity

Loss Functions: Measuring Prediction Error

Backpropagation: How Networks Learn

Gradient Descent and Optimization

Overfitting and Regularization

Why Deep Networks Need GPUs

Related Articles

AI Ethics: Bias, Fairness, Accountability, and the Governance Challenge

The History of AI: From Turing's Test to ChatGPT (Part 2)

Neural Networks for Beginners: How AI Mimics the Brain (Part 5)

Generative AI Explained: How ChatGPT and Image Generators Work (Part 8)