How Neural Networks Work: Layers, Weights, and Learning from Data

The Brain Metaphor and Its Limits

The name "neural network" borrows from neuroscience: both biological brains and artificial neural networks consist of many interconnected units that pass signals to one another. But the analogy should not be taken too literally. The artificial neuron is a far simpler object than the biological one — essentially a weighted sum followed by a nonlinear transformation — and modern deep learning systems bear only a superficial resemblance to how brains actually process information. The biological metaphor was useful for inspiring the field, but the mathematics that makes neural networks work is closer to calculus and linear algebra than to neuroscience.

What neural networks genuinely share with brains is a key design principle: rather than following explicit hand-coded rules, they learn patterns directly from examples. Show a network thousands of labeled images of cats and dogs, and it gradually adjusts its internal parameters until it can reliably distinguish the two, without ever being told what features to look for. This data-driven learning is what makes neural networks powerful, flexible, and sometimes mysterious.

Neurons, Layers, and Architecture

An artificial neuron receives a set of numerical inputs, multiplies each by a corresponding weight, sums the results, adds a scalar called the bias, and then passes the total through an activation function that introduces nonlinearity. Without nonlinearity, a stack of layers would collapse to a single linear transformation, severely limiting what the network could represent. Common activation functions include the rectified linear unit (ReLU), which outputs zero for negative inputs and the identity for positive ones; the sigmoid, which squashes outputs to (0,1); and the softmax, used in output layers for classification to produce a probability distribution over classes.

Neurons are organized into layers. The first layer, the input layer, receives the raw data — pixel values, token embeddings, or sensor readings. One or more hidden layers transform the representation through successive rounds of weighted summation and activation. The final output layer produces the network's answer: a class label, a continuous value, a probability, or a sequence of tokens. A network with many hidden layers is called a deep neural network, giving rise to the term deep learning.

Weights as Encoded Knowledge

The weights are the network's memory. Before training, they are typically initialized to small random values. After training, each weight encodes something about the statistical structure of the data the network has seen. In a network trained to recognize faces, early-layer weights learn to detect edges and color gradients; mid-layer weights combine these into eyes, noses, and mouths; late-layer weights assemble those parts into face-level representations. This hierarchical feature learning is one of the most important properties of deep networks and is what allows them to handle raw, high-dimensional inputs directly rather than requiring hand-engineered features.

The number of weights in a modern large network is staggering. GPT-3 has 175 billion parameters; some vision models and multimodal systems have even more. Each parameter is a single floating-point number, but their collective arrangement encodes an extraordinary amount of learned knowledge about language, images, and the world. Storing and manipulating these weights efficiently requires specialized hardware — GPUs and TPUs — that can perform billions of floating-point operations per second in parallel.

Forward Pass: Making a Prediction

During a forward pass, data flows from the input layer through each hidden layer to the output layer, with each layer applying its weights and activation function to produce a new representation. The computation is entirely deterministic given the current weights: the same input always produces the same output. This forward pass is what happens at inference time — when you type a query into a chatbot or upload an image for classification, you are triggering a series of matrix multiplications and activation functions that transform your input into the model's response.

The key operation at each layer is a matrix multiplication: the input vector is multiplied by the weight matrix for that layer, adding the bias vector, and then the activation function is applied element-wise to the result. For large networks, this is computationally expensive, which is why efficient inference on modern AI systems requires hardware accelerators. Techniques like quantization (reducing the numerical precision of weights) and pruning (removing low-magnitude weights) are used to make inference faster and cheaper without significantly degrading accuracy.

Loss Functions and the Learning Signal

To train a neural network, you need a way to measure how wrong its predictions are. This is the job of the loss function (also called the cost or objective function). For classification tasks, a common choice is cross-entropy loss, which compares the network's output probability distribution to the true one-hot label and penalizes confident wrong predictions heavily. For regression tasks, mean squared error measures the average squared difference between predictions and targets.

The loss function produces a single number for each training example (or batch of examples) that summarizes the network's error. The goal of training is to minimize this number by adjusting the weights. The challenge is that the loss landscape — the high-dimensional surface that maps every possible weight configuration to a loss value — is extraordinarily complex, with millions or billions of dimensions and countless local minima, saddle points, and plateaus. Navigating this landscape efficiently and reliably is the central problem of neural network optimization.

Backpropagation and Gradient Descent

Backpropagation is the algorithm that computes how much each weight contributed to the prediction error. It applies the chain rule of calculus recursively, working backwards from the loss through each layer to produce a gradient — a vector of partial derivatives indicating the direction in which each weight should move to reduce the loss. Backpropagation was first described in the context of neural networks in the 1986 Rumelhart, Hinton, and Williams paper and is the computational foundation of almost all neural network training.

Once the gradients are available, gradient descent updates the weights by subtracting a small multiple of the gradient from each weight. The scaling factor is the learning rate, a hyperparameter that controls the size of each update step. Too large a learning rate causes the optimization to oscillate or diverge; too small a learning rate makes training prohibitively slow. In practice, most modern training uses variants like Adam or AdamW, adaptive optimizers that maintain per-parameter learning rates based on the history of past gradients, providing faster and more robust convergence than vanilla gradient descent.

Overfitting, Regularization, and Generalization

A network that performs perfectly on its training data but fails on new examples is said to overfit. Overfitting occurs when the network memorizes the noise and idiosyncrasies of the training set rather than learning the underlying patterns. It is more likely with large networks trained on small datasets. Detecting overfitting requires holding out a validation set — data not used during training — and monitoring whether validation loss increases while training loss continues to fall.

Several techniques combat overfitting. Dropout randomly zeroes out a fraction of neurons during each training step, forcing the network to develop redundant representations. L2 regularization (weight decay) adds a penalty proportional to the sum of squared weights to the loss, discouraging very large weight values. Data augmentation artificially expands the training set by applying random transformations — flips, crops, color jitter — to existing examples. Early stopping halts training when validation performance stops improving. Together, these techniques help ensure that a trained network generalizes from the training distribution to the real world.

The Scaling Laws and Modern Deep Learning

One of the most striking findings of recent AI research is that neural network performance scales predictably with three factors: the number of parameters, the amount of training data, and the amount of compute. Empirical scaling laws, documented by researchers at OpenAI and Deepmind, show power-law improvements in loss as each factor increases. This has motivated a "scaling hypothesis" — the idea that simply training larger models on more data will continue to yield better and more capable AI systems.

The practical consequence has been an arms race in model scale. Where ResNet-50, a landmark image classification network from 2015, had 25 million parameters, modern vision-language models have hundreds of billions. The infrastructure required to train and serve these models — clusters of thousands of specialized chips, petabytes of storage, sophisticated distributed training frameworks — represents a new kind of industrial-scale scientific research, concentrated in a handful of well-resourced organizations. Understanding neural networks at the mathematical level described here is the foundation for understanding why this scaling works, and what its limits might ultimately be.

How Neural Networks Work: Layers, Weights, and Learning from Data

The Brain Metaphor and Its Limits

Neurons, Layers, and Architecture

Weights as Encoded Knowledge

Forward Pass: Making a Prediction

Loss Functions and the Learning Signal

Backpropagation and Gradient Descent

Overfitting, Regularization, and Generalization

The Scaling Laws and Modern Deep Learning

Related Articles

AI Ethics: Bias, Fairness, Accountability, and the Governance Challenge

The History of AI: From Turing's Test to ChatGPT (Part 2)

Neural Networks for Beginners: How AI Mimics the Brain (Part 5)

Generative AI Explained: How ChatGPT and Image Generators Work (Part 8)