Reinforcement Learning Explained: How AI Learns by Trial and Error

An Algorithm That Taught Itself to Play Atari Games From Raw Pixels

In 2013, DeepMind published a paper describing a system called Deep Q-Network (DQN) that learned to play 49 Atari video games directly from raw pixel inputs and game score signals — with no game-specific knowledge, features, or rules provided. DQN achieved superhuman performance on 29 of the 49 games, including Breakout (where it discovered a counterintuitive tunnel-digging strategy that even experienced human players rarely use) and Pong. The algorithm received only one signal: the change in numerical game score. From this sparse reward, it developed sophisticated strategies through trial and error across millions of game frames. This paper marked the beginning of the deep reinforcement learning era.

The Reinforcement Learning Framework

Reinforcement learning (RL) is a computational framework for decision-making in which an agent learns to act in an environment to maximize a cumulative numerical reward signal. The fundamental components are:

Agent: The learner and decision-maker (a neural network, a software system, a robot controller).
Environment: Everything outside the agent that it interacts with (a game, a simulated physics world, real hardware, a financial market).
State (s): A representation of the current situation. May be the raw observation (pixels) or a structured encoding.
Action (a): A choice the agent can make from the set of available actions.
Reward (r): A scalar signal the environment sends to the agent after each action, indicating how good or bad the action was in context.
Policy (π): The agent's strategy — a mapping from states to actions (or probability distributions over actions).

The agent's goal is to find a policy π* that maximizes the expected cumulative discounted reward: G = Σ γᵗ rₜ, where γ ∈ [0,1) is a discount factor that weights near-term rewards more heavily than distant ones.

The Markov Decision Process

Reinforcement learning problems are formally modeled as Markov Decision Processes (MDPs), defined by a tuple (S, A, P, R, γ):

Symbol	Meaning	Example (chess)
S	State space	All possible board positions (~10⁴⁴)
A	Action space	All legal moves from current position
P(s'\|s,a)	Transition probability	Deterministic (opponent's response not included)
R(s,a,s')	Reward function	+1 for win, −1 for loss, 0 otherwise
γ	Discount factor	~0.99 (future wins nearly as valuable)

The Markov property means that the current state contains all relevant information for future decisions — the past is irrelevant given the state. This property enables efficient computation of optimal policies through dynamic programming.

Key RL Algorithms

Q-Learning and Deep Q-Networks

Q-learning learns the action-value function Q(s,a) — the expected cumulative reward for taking action a in state s and then following the optimal policy. The Q-function is updated via the Bellman equation: Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') − Q(s,a)]. In Deep Q-Networks (DQN), a neural network approximates Q(s,a). Key innovations in DQN: experience replay (storing past transitions and sampling randomly to break temporal correlations) and target networks (using a periodically frozen copy of the Q-network to stabilize training).

Policy Gradient Methods

Instead of learning value functions, policy gradient methods directly optimize the policy π by gradient ascent on expected reward. REINFORCE (1992) is the classic algorithm; more stable variants include:

Actor-Critic methods: Combine a policy network (actor) with a value function (critic) to reduce variance in gradient estimates.
Proximal Policy Optimization (PPO): Developed by OpenAI in 2017; clips policy updates to prevent large destabilizing changes; widely used for its robustness and simplicity. ChatGPT and other RLHF-trained models use PPO as the core RL algorithm during reinforcement learning from human feedback training.
Soft Actor-Critic (SAC): Maximizes both reward and entropy (encouraging exploration); sample-efficient; standard algorithm for continuous control robotics.

Landmark Achievements in Deep RL

System	Year	Task	Algorithm	Achievement
DQN	2013/2015	Atari games	Deep Q-learning	Superhuman on 29/49 Atari games
AlphaGo	2016	Go (19×19)	Monte Carlo Tree Search + policy/value networks	Defeated 18-time world champion Lee Sedol 4–1
AlphaGo Zero	2017	Go	Self-play RL from scratch	Defeated AlphaGo 100–0 without human game data
AlphaStar	2019	StarCraft II	Multi-agent RL	Grandmaster level; top 0.2% of human players
OpenAI Five	2019	Dota 2	PPO + self-play	Defeated world champions OG 2–0
AlphaFold 2	2020	Protein folding	Evoformer + RL-inspired training	Solved a 50-year grand challenge in biology

Exploration vs. Exploitation

A fundamental tension in RL is the explore-exploit dilemma. An agent must exploit known high-reward actions to maximize current performance, but explore unfamiliar actions to discover potentially better ones. Simple strategies include:

ε-greedy: With probability ε, take a random action; with probability 1−ε, take the greedy (best known) action. ε typically decays over training.
Upper Confidence Bound (UCB): Choose actions based on both estimated value and uncertainty — actions with high uncertainty get a bonus to encourage exploration.
Intrinsic motivation: Reward curiosity — add a bonus for visiting novel states. Effective in sparse-reward environments where extrinsic rewards are rare.

RL in the Real World: Robotics, LLMs, and Beyond

Reinforcement learning from human feedback (RLHF) is now central to training large language models including ChatGPT, Claude, and Gemini. Human raters compare model outputs and indicate which is better; these preferences train a reward model; and PPO then fine-tunes the LLM to produce outputs that score higher under the reward model. In robotics, RL has enabled Boston Dynamics robots to learn locomotion and manipulation, factory robots to learn assembly tasks, and research robots to learn from demonstrations. In drug discovery, RL is used to generate candidate molecules that optimize multiple drug-like properties simultaneously. RL is no longer confined to games — it is increasingly embedded in systems that make consequential decisions in the physical and digital world.