What Is Reinforcement Learning and How It Trained AlphaGo
Understand reinforcement learning, where AI agents learn through trial and error. Explore rewards, policies, Q-learning, and how DeepMind used RL to master the game of Go.
What Is Reinforcement Learning?
Reinforcement learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards or penalties. Unlike supervised learning, which requires labeled examples of correct answers, RL agents discover optimal behavior through trial and error, gradually learning which actions lead to the best outcomes.
The concept is inspired by behavioral psychology -- specifically the idea that animals and humans learn through consequences. A dog learns to sit on command because the behavior is followed by a treat (positive reward). A child learns not to touch a hot stove because the behavior is followed by pain (negative reward). RL formalizes this intuition into a mathematical framework that computers can use.
RL is particularly powerful for problems where the correct action is not obvious, where sequences of decisions matter, and where the environment is complex or poorly understood. It has achieved superhuman performance in games, enabled robots to learn physical skills, and is being applied to challenges from drug discovery to traffic optimization.
Key Concepts: Agents, States, Actions, and Rewards
Every RL problem involves four core elements. The agent is the learner or decision-maker -- the algorithm that takes actions and learns from their consequences. The environment is everything the agent interacts with, including the rules, physics, and other entities in the system.
At each moment, the environment is in a particular state that describes its current configuration. In a chess game, the state is the position of all pieces on the board. In a self-driving car, the state includes the car's speed, position, surrounding vehicles, traffic signals, and road conditions.
The agent observes the state and selects an action from the available options. The environment responds by transitioning to a new state and providing a reward signal -- a numerical value indicating how good or bad the action was. The agent's objective is to learn a policy -- a strategy for choosing actions -- that maximizes the total cumulative reward over time, not just the immediate reward from a single action.
Exploration vs. Exploitation
One of the fundamental challenges in RL is the exploration-exploitation dilemma. The agent must balance two competing needs: exploiting actions it already knows yield good rewards, and exploring new actions that might yield even better rewards.
Imagine you have discovered a restaurant you enjoy. You could eat there every night (exploitation), but if you never try other restaurants (exploration), you might miss an even better option. On the other hand, if you spend all your time trying new restaurants, you might waste many evenings on mediocre meals. The optimal strategy involves exploring enough to build a good understanding of the options while increasingly exploiting the best ones as knowledge grows.
RL algorithms handle this tradeoff through strategies like epsilon-greedy, where the agent usually takes the best-known action but occasionally (with probability epsilon) takes a random action to explore. More sophisticated approaches use methods like Upper Confidence Bound (UCB) or Thompson sampling, which direct exploration toward actions with high uncertainty rather than choosing randomly.
Q-Learning and Value Functions
Q-learning is one of the most foundational RL algorithms. It works by learning a function Q(s, a) that estimates the expected cumulative reward of taking action a in state s and then following the optimal policy thereafter. Once this Q-function is accurately learned, the optimal policy is simply to choose the action with the highest Q-value in each state.
The Q-function is updated iteratively using the Bellman equation, which expresses the value of a state-action pair in terms of the immediate reward plus the discounted value of the best action in the next state. A discount factor (typically between 0.9 and 0.99) controls how much the agent values future rewards relative to immediate ones -- a lower discount factor makes the agent more short-sighted.
Traditional Q-learning stores Q-values in a table, which works for problems with small, discrete state and action spaces. For complex problems like video games or robotics, where the state space is enormous or continuous, deep Q-networks (DQN) replace the table with a neural network that approximates the Q-function. DeepMind's 2013 DQN paper demonstrated that a single algorithm could learn to play dozens of Atari games at superhuman levels directly from pixel inputs, a landmark achievement that reignited interest in RL.
Policy Gradient Methods
An alternative approach to value-based methods like Q-learning is policy gradient methods, which directly learn the policy -- the mapping from states to actions -- without first learning a value function. The policy is represented as a parameterized function (often a neural network) that outputs the probability of taking each action given the current state.
The algorithm adjusts the policy parameters using gradient ascent on the expected cumulative reward. Actions that led to high rewards have their probabilities increased, while actions that led to low rewards have their probabilities decreased. The REINFORCE algorithm is the simplest policy gradient method, though it suffers from high variance in its gradient estimates.
Modern approaches like Proximal Policy Optimization (PPO) and Actor-Critic methods combine the strengths of value-based and policy-based approaches. Actor-Critic architectures use two neural networks: an "actor" that learns the policy and a "critic" that learns the value function. The critic helps reduce variance in the policy gradient estimates, leading to more stable and efficient learning.
How AlphaGo Mastered the Game of Go
The game of Go was long considered the ultimate challenge for game-playing AI. The board has 19x19 intersections, creating approximately 10^170 possible positions -- more than the number of atoms in the observable universe. Traditional game tree search methods that worked for chess were completely impractical for Go.
DeepMind's AlphaGo combined deep learning and reinforcement learning in a groundbreaking architecture. First, a supervised learning phase trained a neural network on millions of human expert games to predict likely moves. This gave the network a strong starting policy that mimicked human play patterns.
Then, reinforcement learning took over. AlphaGo played millions of games against copies of itself, using policy gradient methods to improve beyond human-level play. A separate value network learned to evaluate board positions, estimating the probability of winning from any given state. During actual gameplay, AlphaGo combined these neural networks with Monte Carlo tree search (MCTS), using the policy network to guide which branches to explore and the value network to evaluate positions without playing games to completion.
In 2016, AlphaGo defeated Lee Sedol, one of the world's top Go players, 4 games to 1. Its successor, AlphaGo Zero, learned entirely through self-play without any human game data, surpassing the original AlphaGo within 40 days of training. This demonstrated that RL could discover strategies superior to centuries of accumulated human knowledge.
Applications Beyond Games
While games have been RL's most visible success, the technology is increasingly applied to real-world problems. In robotics, RL enables robots to learn manipulation skills like grasping objects, walking, and assembling components through simulated and real-world practice rather than explicit programming.
Data center optimization is another success story. Google uses RL to manage cooling systems in its data centers, reducing energy consumption by 30 to 40 percent. The RL agent continuously adjusts hundreds of variables -- fan speeds, cooling tower settings, and pump configurations -- to maintain optimal temperatures while minimizing power use.
In healthcare, RL is being explored for treatment optimization, where the agent learns to recommend medication dosages or treatment sequences that maximize patient outcomes over time. In finance, RL agents learn trading strategies and portfolio management policies. The common thread across these applications is sequential decision-making under uncertainty, the fundamental problem that RL was designed to solve.
Related Articles
artificial intelligence
AI Ethics: Bias, Fairness, Accountability, and the Governance Challenge
AI systems can embed and amplify human biases, produce discriminatory outcomes, and evade accountability. Explore the core ethical challenges in AI development, from algorithmic fairness to governance frameworks shaping the future of the technology.
11 min read
artificial intelligence
The History of AI: From Turing's Test to ChatGPT (Part 2)
Artificial intelligence has a richer and more turbulent history than most people realize, stretching back more than seventy years. This article traces the key breakthroughs, painful setbacks, and unexpected leaps that brought us from Alan Turing's 1950 thought experiment to the ChatGPT era.
8 min read
artificial intelligence
Neural Networks for Beginners: How AI Mimics the Brain (Part 5)
Neural networks are the engine behind most modern AI, from image recognition to language generation. This beginner-friendly guide explains neurons, layers, weights, activation functions, and the training process in plain language — no math required.
8 min read
artificial intelligence
Generative AI Explained: How ChatGPT and Image Generators Work (Part 8)
Generative AI can write essays, compose code, paint images, and hold conversations — but how does it actually work? This article demystifies large language models, diffusion-based image generators, and the art and science of prompting.
8 min read