What Is a Distributed System: CAP Theorem, Consistency, and Why It's Hard

What Is a Distributed System?

A distributed system is a collection of autonomous computing elements — processes, servers, or services — that communicate and coordinate their actions by passing messages over a network, while appearing to users as a single coherent system. The definition captures both what makes distributed systems useful (the combined capabilities of many machines) and what makes them hard (coordination across an unreliable network). A distributed system can provide more computing power, storage, or availability than any single machine, but it does so by introducing a new class of failure modes and design challenges absent from single-machine programs.

Distributed systems are everywhere in modern computing. The web application you use daily is likely a distributed system: your request hits a load balancer, which forwards it to one of many application servers, which queries a distributed database, which may in turn consult a distributed cache and a distributed message queue before returning a response. Each of these components runs on multiple machines simultaneously, coordinating through network messages. When you book a flight, reserve a hotel, process a payment, or stream a video, multiple distributed systems are involved, each handling their portion of the work while maintaining consistency with the others.

The theoretical foundation for distributed systems was largely established by a series of landmark papers and impossibility results in the 1970s through 1990s. Leslie Lamport's work on logical clocks and distributed algorithms, Fisher, Lynch, and Paterson's FLP impossibility result (showing that no asynchronous distributed consensus algorithm can guarantee termination in the presence of a single faulty process), and Eric Brewer's CAP theorem together defined the fundamental constraints within which all distributed systems must operate. Understanding these constraints is essential for building systems that work correctly under real-world conditions.

The CAP Theorem Explained

The CAP theorem, formulated by Eric Brewer in a 2000 keynote and proved by Gilbert and Lynch in 2002, states that a distributed system can guarantee at most two of the following three properties simultaneously: Consistency (C), Availability (A), and Partition tolerance (P). Consistency means that every read receives the most recent write or an error — all nodes see the same data at the same time. Availability means that every request receives a response (though not necessarily the most recent data). Partition tolerance means that the system continues to operate even when network messages between nodes are delayed or lost arbitrarily.

The crucial insight is that network partitions — failures that prevent some nodes from communicating with others — are not optional. Networks fail. Links drop packets, routers crash, and physical connections are severed. In any distributed system that operates across more than a single datacenter, and often even within a single datacenter, the possibility of a partition cannot be eliminated. This means partition tolerance is not a choice to be made: a distributed system must be partition-tolerant or it stops working whenever the network is unreliable. The real choice, during a partition, is between consistency and availability: the system must decide whether to refuse requests (to maintain consistency) or continue serving requests with potentially stale data (to maintain availability).

The practical implications are significant. A database that chooses consistency over availability during a partition will reject writes (and possibly reads) rather than risk returning stale data. This is the right choice for systems where serving stale data would cause serious harm — financial transaction processing, inventory management, reservation systems. A database that chooses availability over consistency will continue to accept reads and writes during a partition, accepting that nodes might temporarily diverge and must reconcile after the partition heals. This is appropriate for systems where occasional stale data is acceptable — social media feeds, shopping cart items, view counts. The CAP theorem doesn't tell you which choice to make; it tells you that you must make a choice and design for the consequences.

Consistency Models: A Spectrum

Consistency in distributed systems is not binary — there is a rich spectrum of consistency models between strict serializable consistency and complete inconsistency. Understanding this spectrum is essential for designing systems that provide the right guarantees for their use case while minimizing coordination overhead. Stronger consistency guarantees require more coordination (more network messages, higher latency) and come at a greater availability cost during partitions. Weaker consistency allows better performance and availability at the cost of more complex application-level reasoning about data correctness.

Linearizability (also called strong consistency or atomic consistency) is the strongest useful consistency model. It guarantees that operations appear to execute atomically at a single point in time, and that all clients see a consistent, real-time view of all data. Any read following a write returns the written value; operations have a global order consistent with real time. Linearizability is what most developers expect from a single-machine database, and it's what makes systems easy to reason about. The cost is high: achieving linearizability requires that all writes be acknowledged by a quorum of replicas before returning success, adding round-trip latency and coupling availability to quorum availability.

Eventual consistency is the weakest useful consistency model. It guarantees only that if no new updates are made to a given data item, eventually all reads will return the last updated value. There is no bound on how long "eventually" might be, and at any given point, different nodes may return different values for the same key. Amazon's Dynamo paper (2007) popularized eventual consistency as a design choice for high-availability, high-throughput storage systems where consistency could be relaxed in exchange for extreme availability and low latency. Many NoSQL databases default to eventual consistency. The application-level challenge is that programs must be written to tolerate and handle inconsistency — a significant cognitive burden that is frequently underestimated.

Consensus: The Core Problem of Distributed Systems

Many fundamental distributed systems problems reduce to consensus: how do a set of processes agree on a single value, even in the presence of failures? Leader election, distributed transactions, replicated state machines, and distributed locking all require consensus. The FLP impossibility result establishes that no deterministic consensus algorithm can guarantee termination in an asynchronous distributed system where even a single process may fail. This doesn't mean consensus is impossible in practice — it means that practical algorithms must make assumptions about timing (synchrony) or probabilistic guarantees, rather than offering absolute deterministic guarantees.

Paxos, developed by Leslie Lamport in the 1980s (published 1998), is the foundational consensus algorithm. It proceeds in two phases: a prepare/promise phase in which a proposer seeks commitment from a majority of acceptors not to accept proposals with lower IDs; and an accept/accepted phase in which the proposer asks acceptors to accept a specific value, which is decided when a majority accepts. Paxos is correct but notoriously difficult to understand and implement correctly. Its variants — Multi-Paxos, Fast Paxos, Flexible Paxos — address specific performance and complexity concerns, but remain complex. Google's Chubby distributed lock service and its successor systems, and Apache ZooKeeper, are built on Paxos variants.

Raft, designed by Diego Ongaro and John Ousterhout (published 2014), was explicitly designed to be more understandable than Paxos while providing equivalent safety guarantees. It separates consensus into distinct subproblems — leader election, log replication, safety — and makes decisions at each step clearer. A Raft cluster elects a single leader that handles all client requests; the leader replicates log entries to followers before acknowledging success. If the leader fails, a new election selects a new leader from among the up-to-date followers. Raft is the basis for CockroachDB, TiKV (used in TiDB), etcd (Kubernetes' distributed configuration store), and Consul. Its relative clarity has made it the algorithm of choice for new distributed systems development.

Distributed Transactions and the Two-Phase Commit

A distributed transaction must succeed or fail atomically across multiple nodes or services. In a relational database, ACID transactions guarantee this within a single database instance. Distributing a transaction across multiple database shards, multiple services, or multiple data stores removes the single coordinator that enforces atomicity, requiring explicit distributed coordination protocols.

Two-Phase Commit (2PC) is the classic protocol for distributed transactions. In the prepare phase, a coordinator asks each participant (database shard or service) to prepare the transaction — to verify that it can commit and to lock the affected resources. If all participants respond "ready," the coordinator sends a commit message in the commit phase; if any participant responds "abort" or fails to respond, the coordinator sends an abort message and all participants roll back. 2PC guarantees atomicity: either all participants commit or all abort. Its weakness is that it is blocking: if the coordinator fails after participants have responded "ready" but before sending commit or abort, participants remain locked, waiting for a decision that never comes — the "coordinator failure" problem.

Three-Phase Commit (3PC) was designed to address coordinator failure but introduces its own complexity and is rarely used in practice. Practical distributed systems have largely moved to saga patterns for managing distributed transactions in microservices architectures. A saga breaks a distributed transaction into a sequence of local transactions, each in a separate service, with compensating transactions that undo previously completed steps if a later step fails. Sagas avoid the blocking and coordinator dependency of 2PC at the cost of eventual (rather than immediate) consistency during the saga's execution — a window during which the system is in a partially committed state. Designing correct compensating transactions and handling all failure cases in a saga requires careful engineering and comprehensive testing.

Replication, Partitioning, and Data Distribution

Distributed databases achieve horizontal scalability through two fundamental techniques: replication and partitioning. Replication copies data across multiple nodes, improving read throughput (reads can be distributed across replicas) and availability (if one node fails, others can serve requests). Partitioning (also called sharding) divides data into subsets distributed across multiple nodes, improving write throughput and storage capacity by distributing load.

Leader-follower (primary-replica) replication designates one node as the leader that accepts writes; the leader replicates changes asynchronously or synchronously to follower replicas that serve reads. Synchronous replication (waiting for at least one follower to confirm before acknowledging the write to the client) improves durability but adds write latency. Asynchronous replication reduces write latency but risks data loss if the leader fails before replicating to any follower. The choice between synchronous and asynchronous replication involves the same fundamental trade-offs as the CAP theorem: durability and consistency versus performance and availability.

Multi-leader replication allows multiple nodes to accept writes, improving write availability and reducing latency in geographically distributed deployments by directing writes to the nearest leader. The challenge is write conflict resolution: if two clients write to the same record on different leaders simultaneously, both writes are initially accepted, creating a conflict that must be detected and resolved. Conflict resolution strategies — last-write-wins (by timestamp or version number), merge operations, application-defined resolution functions — each have failure modes and must be chosen with understanding of the specific application's data semantics. CRDTs (Conflict-free Replicated Data Types) are data structures designed so that concurrent updates can always be merged automatically without conflicts — a powerful technique for specific data types like counters, sets, and text documents.

Observability and Debugging Distributed Systems

Distributed systems are notoriously difficult to debug. A request that traverses ten services may fail in the eleventh, and tracing the failure requires correlating log entries across all twelve services, each potentially running on dozens or hundreds of nodes and generating thousands of log lines per second. Traditional debugging tools — adding print statements, attaching a debugger — don't work: you can't pause a production distributed system to step through code, and the failure may only manifest under specific load conditions or timing that a test environment can't replicate.

Distributed tracing addresses the observability challenge by propagating a unique trace identifier through all service calls in a request's causal chain. Each service records a "span" — a record of its portion of work, including duration, status, and contextual attributes — associated with the trace ID. Collecting all spans for a trace and reconstructing the call tree reveals exactly which service calls occurred, in what order, with what latency, and where failures occurred. Jaeger, Zipkin, Honeycomb, and Lightstep are widely used distributed tracing systems. The OpenTelemetry project provides a vendor-neutral SDK and wire protocol for collecting traces, metrics, and logs, enabling consistent instrumentation across polyglot microservices environments.

Chaos engineering — the practice of deliberately introducing failures into a running system to test resilience — emerged at Netflix as a methodology for building genuine confidence in distributed system reliability. Netflix's Chaos Monkey randomly terminates virtual machine instances in production; Chaos Kong takes down entire AWS regions. The principle is that if failures are inevitable, regularly exercising failure scenarios and fixing the resulting gaps in resilience is better than avoiding failures and discovering them at catastrophic scale. Chaos engineering frameworks like LitmusChaos and Gremlin provide controlled injection of network failures, resource exhaustion, and process termination for systematic resilience testing. The practice has spread beyond Netflix to become a recognized software engineering discipline, though adoption in risk-averse industries has been slower than among internet-scale companies with higher fault tolerance for planned experiments.

What Is a Distributed System: CAP Theorem, Consistency, and Why It's Hard