How Deep Learning Works: Layers, Weights, and Gradient Descent

Deep learning trains multi-layered neural networks to recognize patterns in data using backpropagation and gradient descent. Discover the mechanics that power modern AI systems.

The InfoNexus Editorial TeamMay 16, 20269 min read

The Technique That Taught Computers to Recognize Cats on YouTube

In 2012, Google Brain researchers Andrew Ng and Jeff Dean built a neural network with 16,000 computer processors and trained it on 10 million YouTube thumbnail images without labels. The network spontaneously developed a neuron that responded strongly to cat faces — one of the most common subjects in online video — without ever being told what a cat was. The experiment demonstrated that deep neural networks could autonomously learn high-level concepts from raw data at scale. That same year, deep learning achieved superhuman performance on the ImageNet visual recognition benchmark. Within a decade, deep learning transformed every field of AI, from speech recognition to protein structure prediction.

The Architecture of a Deep Neural Network

A deep neural network (DNN) consists of an input layer, multiple hidden layers, and an output layer. Each layer contains units (neurons) that receive weighted inputs from the previous layer, apply a nonlinear activation function, and pass results to the next layer.

  • Input layer: Receives raw data — pixel values for images, token embeddings for text, sensor readings for time series. No computation occurs here; values are passed directly forward.
  • Hidden layers: The "deep" in deep learning. Each hidden layer learns increasingly abstract representations of the input. Early layers detect simple features (edges, frequencies); later layers combine these into complex concepts (faces, words, intentions). A typical modern architecture may have 12–1,000+ hidden layers.
  • Output layer: Produces the final prediction. For classification, a softmax activation produces a probability distribution over classes. For regression, a linear activation outputs a continuous value.

Each connection between neurons has an associated weight (a floating-point number). The network also has bias terms at each layer. Together, these parameters — potentially billions in large models — determine the network's behavior. Training means finding the parameter values that minimize prediction error on the training data.

Activation Functions: Introducing Nonlinearity

Without nonlinear activation functions, a network with multiple layers is mathematically equivalent to a single linear layer and cannot learn complex patterns. Common activation functions include:

FunctionFormulaRangeCommon Use
ReLUmax(0, x)[0, ∞)Hidden layers (most CNNs, fully connected)
Leaky ReLUmax(0.01x, x)(−∞, ∞)Addresses "dying ReLU" problem
GELUx × Φ(x)(−∞, ∞)Transformer hidden layers (BERT, GPT)
Sigmoid1/(1+e⁻ˣ)(0, 1)Binary classification outputs
Softmaxeˣⁱ/Σeˣʲ(0, 1), sums to 1Multi-class classification outputs
Tanh(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)(−1, 1)RNNs, older architectures

Loss Functions: Measuring Prediction Error

The loss function quantifies how wrong the network's predictions are. Training seeks to minimize the loss. Common loss functions:

  • Cross-entropy loss (log loss): For classification: L = −Σ yᵢ log(ŷᵢ), where y is the true label and ŷ is the predicted probability. Penalizes confident wrong predictions heavily.
  • Mean Squared Error (MSE): For regression: L = (1/n)Σ(y − ŷ)². Penalizes large errors quadratically.
  • Binary cross-entropy: For binary classification: L = −[y log(ŷ) + (1−y) log(1−ŷ)].

Backpropagation: How Networks Learn

Backpropagation is the algorithm that computes the gradient of the loss function with respect to every weight in the network. It uses the chain rule of calculus to propagate error signals backward from the output layer through each hidden layer to the input, efficiently computing partial derivatives for all parameters in a single backward pass.

The process per training step:

  1. Forward pass: Input data flows through the network; output is computed.
  2. Loss computation: Compare output to true label; compute loss.
  3. Backward pass: Compute gradient of loss with respect to each weight using the chain rule, propagating backward from output to input.
  4. Weight update: Adjust each weight by a small step in the direction that reduces the loss: w ← w − η × ∂L/∂w, where η is the learning rate.

Gradient Descent and Optimization

Gradient descent iteratively adjusts weights to minimize the loss function. Several variants exist with different computational efficiency and convergence properties:

MethodBatch SizeUpdate FrequencyCharacteristics
Batch Gradient DescentFull datasetOnce per epochStable but slow for large datasets
Stochastic Gradient Descent (SGD)1 sampleEvery sampleNoisy updates; can escape local minima
Mini-batch SGD32–512 samplesEvery mini-batchStandard approach; GPU-efficient
AdamMini-batchEvery mini-batchAdaptive learning rates; widely used for LLMs
AdamWMini-batchEvery mini-batchAdam + weight decay regularization; standard for transformers

Overfitting and Regularization

A model with too many parameters relative to training data can memorize training examples rather than learning generalizable patterns — this is overfitting. Techniques to combat it:

  • Dropout: During training, randomly deactivate a fraction (typically 10–50%) of neurons at each forward pass. Forces the network to learn redundant representations. Halving output at test time to match expected value.
  • Weight decay (L2 regularization): Adds λ||w||² to the loss, penalizing large weights and encouraging simpler functions.
  • Data augmentation: Artificially expand the training set by applying random transformations (image flips, crops, brightness changes) that preserve labels.
  • Early stopping: Monitor validation loss during training; stop when it begins to increase, indicating overfitting.
  • Batch normalization: Normalize layer inputs to zero mean and unit variance within each mini-batch; accelerates training and acts as implicit regularization.

Why Deep Networks Need GPUs

Training a modern deep learning model involves billions of floating-point multiplications per forward pass, repeated millions of times over training. A GPU with thousands of cores (an NVIDIA H100 has 16,896 CUDA cores) can execute these matrix multiplications in parallel, providing ~100–1,000x speedup over a CPU for deep learning workloads. The H100 GPU delivers approximately 1,979 TFLOPS (teraflops) of BF16 tensor operations — the core computation in modern neural networks. Training GPT-4 reportedly required approximately 25,000 A100 GPUs running for about 90 days, consuming roughly 50 GWh of electricity. The availability of massive GPU clusters is now a primary determinant of which organizations can develop frontier AI systems.

AIdeep learningneural networks

Related Articles