Neural Networks for Beginners: How AI Mimics the Brain (Part 5)

AI Fundamentals Series · Part 5 of 10 — Previous: Part 4: Data — The Fuel That Powers AI — Next: Part 6: Natural Language Processing

Opening the Black Box

By now, in this series, you know that AI systems learn from data rather than following explicit rules (Part 3), and that the quality of that data determines much of an AI system's usefulness (Part 4). But there is still a crucial question outstanding: what is actually happening inside the model when it learns? What mathematical structure is doing the learning?

The answer, for most modern AI, is a neural network. This article will demystify neural networks without requiring you to understand any calculus or linear algebra. The goal is to give you a clear conceptual picture of how these structures work — because that picture will make everything in Parts 6, 7, and 8 make much more sense.

Inspiration from Biology (With Important Caveats)

The neural network concept was inspired by the structure of the human brain. Brains consist of roughly 86 billion neurons, each connected to thousands of others. When a neuron receives enough electrical signals from neighboring neurons, it “fires” and sends a signal onward. Learning, in biological brains, involves strengthening or weakening the connections between neurons based on experience.

Artificial neural networks borrow this general architecture: many simple units (artificial neurons), each connected to others, with adjustable connection strengths (weights). The similarity is real but should not be overstated. Artificial neural networks are much simpler than biological brains, operate on fundamentally different principles, and were not designed to precisely model neuroscience. They are better thought of as a powerful mathematical tool that was inspired by biology.

The Artificial Neuron

An artificial neuron is a simple mathematical function. Here is what it does, step by step:

Receive inputs: it receives several numbers as input. These might be pixel values from an image, word embeddings from text, or outputs from neurons in the previous layer.
Multiply by weights: each input is multiplied by a corresponding weight, a number that represents how important that particular input is. A high positive weight means the input strongly pushes the neuron toward activating. A high negative weight means the input strongly pushes against activation. A near-zero weight means the input is mostly ignored.
Sum everything up: all the weighted inputs are added together, along with a constant called a bias (which lets the neuron activate even when all inputs are zero, if needed).
Apply an activation function: the sum is passed through a mathematical function that introduces non-linearity. Without this step, the entire network would behave like simple linear algebra and would be incapable of learning complex patterns. Common choices include ReLU (Rectified Linear Unit, which simply outputs zero for negative numbers and the number itself for positive ones) and sigmoid (which squashes any input into a value between 0 and 1).
Output: the result is passed to neurons in the next layer.

That is the entirety of a single artificial neuron: a weighted sum followed by a non-linear squash. Nothing mysterious. The power comes from combining millions or billions of these simple units in a structured architecture.

Layers: How Neurons Are Organized

Neurons in a neural network are arranged in layers. The standard structure has three types:

Input Layer

The first layer receives the raw data. For an image classifier, the input layer might have one neuron for each pixel in the image. A 28×28 pixel grayscale image has 784 pixels, so the input layer has 784 neurons. For text, inputs are typically numerical representations of words (more on this in Part 6).

Hidden Layers

Between the input and output layers lie one or more hidden layers. These layers do the computational heavy lifting. Each hidden layer learns to detect progressively more abstract patterns:

The first hidden layer might learn to detect simple edges and color gradients in an image.
The second hidden layer might combine those edges into textures and shapes like curves, circles, and corners.
The third hidden layer might recognize higher-level structures like eyes, noses, or wheels.
Deeper layers might recognize full objects like faces or cars.

The word “deep” in “deep learning” refers simply to networks with many hidden layers (often dozens or hundreds of them). Depth is what enables the hierarchical feature learning that makes modern AI so powerful.

Output Layer

The final layer produces the network's answer. For image classification into ten categories, the output layer might have ten neurons, each representing one category. The category whose neuron has the highest activation is the network's prediction. For language generation, the output layer might have one neuron for each possible word (or token) in the vocabulary — perhaps 50,000 neurons — and the network chooses the word with the highest activation as the next word to generate.

Forward Pass: From Input to Prediction

When data enters the network and flows from the input layer through the hidden layers to the output layer, this is called a forward pass. At each layer, every neuron takes its inputs from the previous layer, computes its weighted sum plus bias, applies its activation function, and passes the result forward.

The first time you run a forward pass on a freshly initialized network, the output will be random garbage — because the weights were initialized randomly and have not been tuned yet. The entire purpose of training is to adjust those weights so that the forward pass produces correct predictions.

Backpropagation: How the Network Learns

Training a neural network involves repeatedly doing two things: a forward pass to generate a prediction, and then a backward pass (backpropagation) to correct the weights. Here is the intuition:

Make a prediction: run the input through the network and get an output.
Measure the error: compare the output to the correct answer using a loss function (a formula that measures how wrong the prediction was). If the network predicted 0.2 probability for the correct class but the correct answer was 1.0, the loss is high.
Assign blame: mathematically determine how much each weight contributed to the error. Weights that led to a large error get more blame. This is what backpropagation computes — it traces the error backward through the network, layer by layer, calculating each weight's contribution.
Nudge the weights: adjust each weight slightly in the direction that would have reduced the error. This adjustment process is called gradient descent. The size of the adjustment is controlled by a parameter called the learning rate.
Repeat: do this for millions of training examples, and the weights gradually converge toward values that make accurate predictions across the whole dataset.

Here is an analogy: imagine you are trying to find the lowest point in a hilly landscape while blindfolded. You can only feel which direction is downhill right where you are standing. Gradient descent is exactly this: at each step, you feel for the slope and take a small step downhill. Over many steps, you move toward the bottom of a valley (a low-loss region).

Why “Deep” Networks Are So Powerful

Shallow networks (one or two hidden layers) can theoretically approximate any function given enough neurons, but in practice they require an impractically large number of neurons to learn complex real-world patterns. Deep networks learn hierarchical representations far more efficiently: each layer builds on the abstractions learned by the previous layer, allowing the network to reach sophisticated understanding with far fewer total parameters.

The key insight that unlocked deep learning around 2006–2012 was finding ways to train very deep networks without the signal getting lost or distorted as it propagated backward through many layers — a problem called the vanishing gradient problem. Better activation functions (like ReLU), better initialization methods, and architectural tricks like skip connections (used in ResNet) all contributed to solving this.

How Large Are Modern Neural Networks?

The scale of modern neural networks is staggering:

Model	Year	Approximate Parameters
LeNet (digit recognition)	1998	~60,000
AlexNet	2012	~60 million
GPT-2	2019	~1.5 billion
GPT-3	2020	~175 billion
GPT-4 (estimated)	2023	~1 trillion

Each “parameter” is one weight value in the network. Training these models requires adjusting all of those weights over many passes through enormous datasets — which is why AI training requires vast computational infrastructure.

Common Neural Network Architectures Beyond the Basics

The fully connected (dense) neural network we have described so far — where every neuron connects to every neuron in adjacent layers — is the conceptual foundation. But real-world applications use specialized architectures that are adapted to specific types of data. Understanding their names will help you follow AI news and research:

Convolutional Neural Networks (CNNs)

Designed for image data, CNNs use small sliding filters that detect local patterns (edges, textures) and share those filters across the entire image. This dramatically reduces the number of parameters compared to a fully connected network and makes the model naturally suited to spatial data. We explore CNNs thoroughly in Part 7: Computer Vision.

Recurrent Neural Networks (RNNs)

Designed for sequential data like text or time series, RNNs process inputs one step at a time and maintain a “memory” of previous steps through a hidden state that is updated at each position. RNNs were the dominant approach for language processing until Transformers largely replaced them after 2017. Their main weakness was difficulty capturing long-range dependencies — information from many steps back was often lost or distorted.

Transformer Networks

The architecture that underpins virtually all modern large language models. Instead of processing sequences step by step, Transformers process all positions simultaneously using a mechanism called self-attention that allows every position to directly attend to every other position. This removes the long-range dependency problem and scales extremely well with increased compute and data. We cover Transformers in detail in Part 6.

Graph Neural Networks (GNNs)

Designed for data that is naturally represented as a graph — nodes connected by edges — such as molecular structures, social networks, and knowledge graphs. GNNs are particularly important in drug discovery and chemistry, where molecular structures are naturally modeled as graphs of atoms connected by chemical bonds.

The Intuition Behind Overfitting and Generalization

One of the most important practical concepts in neural network training is the distinction between memorizing and generalizing. A sufficiently large network can memorize the training data perfectly — achieving zero error on every training example. But a model that has merely memorized the training data without learning the underlying patterns will perform poorly on new, unseen examples. This failure mode is called overfitting.

Here is an analogy. Imagine a student who prepares for a history exam by memorizing the exact answers to past exam questions, without understanding the underlying history. They will ace any question that appeared on a previous exam, but fail on any new question that requires actual understanding. An overfitted neural network is doing the same thing: it has “memorized” the training data rather than learning the underlying patterns.

Researchers use several techniques to combat overfitting:

Regularization: penalizing models for having very large weight values, which tends to prevent them from fitting noise in the training data
Dropout: randomly disabling a fraction of neurons during each training step, forcing the network to learn redundant representations
Early stopping: monitoring performance on a held-out validation set and stopping training when validation performance stops improving, even if training error continues to decrease
Data augmentation: artificially expanding the training dataset by applying transformations (rotations, crops, color shifts for images), exposing the model to more varied examples

A model that generalizes well — that performs nearly as well on new data as on its training data — is said to have good generalization. Achieving generalization, rather than mere memorization, is the central challenge of machine learning practice.

Key Takeaways

A neural network is a collection of simple mathematical units (neurons) arranged in layers.
Each neuron computes a weighted sum of its inputs, adds a bias, and applies a non-linear activation function.
Information flows from input layer through hidden layers to the output layer in a forward pass.
Backpropagation traces the prediction error backward through the network to adjust each weight slightly toward better performance.
Depth (many layers) enables hierarchical feature learning, which is why “deep learning” is so powerful.
Modern large models have hundreds of billions of parameters — each one tuned over trillions of training examples.

Now that you understand the structure of neural networks, we can apply that knowledge to specific domains. Part 6 will show how neural networks are adapted to understand and generate human language — the foundation of chatbots, translation, search, and much more.