Neural Networks Explained: From Perceptron to Transformer

Frank Rosenblatt Built the First Learning Machine in 1957

The Perceptron — built by Cornell psychologist Frank Rosenblatt and funded by the US Navy — was not software. It was hardware: 400 photocells wired through adjustable potentiometers to a single output neuron. It could learn to distinguish simple visual patterns by adjusting those potentiometers based on whether its output was correct or not. The 1969 book "Perceptrons" by Marvin Minsky and Seymour Papert mathematically proved that a single-layer Perceptron could not solve linearly inseparable problems — including the simple XOR function. Neural network funding collapsed. The first AI winter began. The limitation was real. But the solution — adding hidden layers — already existed in theory. Nobody trained multi-layer networks yet because nobody had an efficient algorithm to do it.

Backpropagation: The Algorithm That Changed Everything

Multi-layer neural networks require a method to assign credit (or blame) for errors backward through each layer of weights. Backpropagation does this by applying the chain rule of calculus iteratively from the output layer to the input. For each training example, the network makes a prediction, computes an error (loss), calculates how each weight contributed to that error using partial derivatives, and adjusts every weight proportionally in the direction that reduces loss. Rumelhart, Hinton, and Williams published the definitive formulation in Nature in 1986, though earlier versions existed. The algorithm works, but requires differentiable activation functions — which is why ReLU (rectified linear unit, introduced by Nair and Hinton in 2010) was a practical breakthrough: it trains faster than sigmoid and tanh while remaining differentiable almost everywhere.

Activation Function	Formula	Gradient Vanishing?	Primary Use
Sigmoid	1/(1+e^-x)	Yes (saturates at extremes)	Output layer (binary classification)
Tanh	(e^x - e^-x)/(e^x + e^-x)	Yes (milder than sigmoid)	RNNs; hidden layers (older)
ReLU	max(0, x)	No (but "dying ReLU" possible)	CNN hidden layers (most common)
GELU	x·Φ(x) (Gaussian CDF)	Minimal	Transformer hidden layers
Softmax	e^xi / Σe^xj	N/A	Output layer (multi-class classification)

CNN, RNN, and Transformer: Three Architectures, Three Problems

Deep learning's power comes from matching architecture to data structure. The three dominant architecture families each exploit different structural properties of the data they process.

Convolutional Neural Networks (CNNs): Images have translation invariance — a cat is a cat whether in the top-left or bottom-right of the frame. CNNs exploit this by applying learned filters (kernels) that slide across spatial positions, sharing weights across the entire image. This parameter sharing dramatically reduces the number of weights needed versus a fully connected network, while capturing local spatial structure. LeCun's LeNet-5 (1998) demonstrated CNNs on handwritten digit recognition; AlexNet (2012, Krizhevsky, Sutskever, Hinton) used CNNs with GPU training to win ImageNet with a 16.4% top-5 error rate — cutting the previous best by 10 percentage points in a single step.

Recurrent Neural Networks (RNNs): Sequential data (language, time series, audio) requires models that maintain state across positions. RNNs achieve this through a hidden state that is updated at each timestep and passed to the next, creating a form of memory. The problem: gradients either vanish or explode when backpropagated through many timesteps. Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, solved this with gating mechanisms (input, forget, output gates) that control what information to preserve or discard.

Transformers: Transformers (Vaswani et al., 2017) replaced recurrence entirely with self-attention, allowing direct computation of relationships between any two positions in a sequence regardless of distance. This enables parallel processing of entire sequences and scales dramatically better with compute. Transformers now dominate NLP, protein structure prediction (AlphaFold), image generation (Vision Transformers), and audio processing.

The Universal Approximation Theorem

George Cybenko (1989) and Kurt Hornik (1991) independently proved that a neural network with a single hidden layer of sufficient width, using a sigmoidal activation function, can approximate any continuous function on a compact subset of R^n to arbitrary precision. This theorem is often cited as the theoretical foundation for neural network capability — but it is frequently misunderstood. It guarantees the existence of a network that can approximate any function; it says nothing about whether gradient descent can find that network, how much data is needed to train it, or how wide the hidden layer must be (potentially enormous). Deep networks with multiple narrower layers often generalize better in practice than the shallow-wide networks the theorem describes.

Overfitting and Regularization

A neural network with enough parameters can memorize training data perfectly — achieving zero training loss while failing completely on new examples. This is overfitting. It is not a failure of the model to learn; it is a failure to learn the right things.

Dropout (Srivastava et al., 2014): Randomly sets a fraction of neurons to zero during each training step, forcing the network to learn redundant representations. Equivalent to training an ensemble of many sub-networks simultaneously.
L2 Regularization (Weight Decay): Adds a penalty term proportional to the sum of squared weights, discouraging any single weight from growing very large. Implemented as a small multiplier on weight magnitudes during backpropagation.
Batch Normalization (Ioffe and Szegedy, 2015): Normalizes layer activations within each mini-batch during training, reducing internal covariate shift and allowing higher learning rates. Has implicit regularization effects.
Early Stopping: Monitor validation loss during training; stop when it begins increasing even though training loss continues falling — the simplest and most reliable overfitting prevention method.

Architecture	Strength	Weakness	Dominant Application Domain
CNN	Spatial hierarchy, parameter efficiency	Limited long-range sequence modeling	Image recognition, object detection
LSTM/RNN	Sequential state memory	Long-range gradient decay; cannot parallelize	Time series; legacy NLP
Transformer	Global attention, massive parallelism, scale	Quadratic memory vs. sequence length (vanilla)	NLP, vision, multimodal, protein structure
Graph Neural Network	Non-Euclidean structure (molecules, social networks)	Oversmoothing at depth	Drug discovery, recommendation, chemistry

Deep Learning's Unresolved Problems

Neural networks remain empirically powerful but theoretically poorly understood. Optimization landscapes contain billions of parameters and researchers still cannot fully explain why gradient descent finds good solutions as reliably as it does. Interpretability — understanding what individual neurons or layers represent — is an active research frontier. The double descent phenomenon (test error improving again after classical overfitting peak as model size grows) confounds classical statistical learning theory. These are not academic footnotes; they are active obstacles to building reliably trustworthy AI systems at scale.

Neural Networks Explained: From Perceptron to Transformer

Frank Rosenblatt Built the First Learning Machine in 1957

Backpropagation: The Algorithm That Changed Everything

CNN, RNN, and Transformer: Three Architectures, Three Problems

The Universal Approximation Theorem

Overfitting and Regularization

Deep Learning's Unresolved Problems

Related Articles

AI Ethics: Bias, Fairness, Accountability, and the Governance Challenge

The History of AI: From Turing's Test to ChatGPT (Part 2)

Neural Networks for Beginners: How AI Mimics the Brain (Part 5)

Generative AI Explained: How ChatGPT and Image Generators Work (Part 8)