Tradeoffs in modern distributed system design are everywhere — the CAP theorem is only part of the story. And it is still valid but we have a ton of critique on it. Why is the CAP theorem true? What does the CAP theorem explain? Do we have an alternative?
The first version of the CAP principle appeared as ACID versus BASE. But then it changed its definition slightly and acquired a proof, which turned it into a theorem.
The CAP theorem states that a distributed system can have at most two properties out of three simultaneously:
Consistency
Consistency here is quite different from the consistency guaranteed in ACID transactions(more about this later). In the formal proof, this property is called atomicity and is expressed in the existence of a distributed system of a common order of operations, which is similar to the order occurring in a non-distributed system. It is essentially linearizability. Thus, each read operation must return the value set in the last write operation.
Availability
The existence of this property means, that every query received by a healthy node entails an answer. Author, Eric Brewer originally put forward a softer requirement: "Almost all queries must get answers," but a strict version of this requirement was used in the proof of the theorem.
In practice, a time constraint should be added to this definition, since lack of it can lead to a result equivalent to no answer.
Partition tolerance
A system with this property will remain functional if an arbitrary number of messages sent between nodes on the network are lost. Partitioning means that all packets from the nodes of one partition do not reach the nodes of the other partition.
In practice, those messages whose delivery has exceeded a certain time limit are also considered lost. This is a consequence of the requirements for the functioning of the system, as well as a necessity in the case of asynchronous communication between nodes — the presence of asynchrony makes it difficult to detect lost nodes — you can not determine whether the message was not sent/received at all, or it goes slow due to delays in the network.
CAP explained
The easiest way to understand CAP is to think of two partitions on different nodes. Allowing at least one node to update its state will cause the nodes to become inconsistent, resulting in a loss of C. Likewise, if the choice is made to remain consistent, one side of the partition must act as if it is unavailable, resulting in the loss of A. Only when the nodes interact can both consistency and accessibility be preserved, leading to the loss of P.
The proof of the CAP theorem is considered in the context of asynchronous systems, characterized by the absence of timing(the nodes of the system make decisions based only on the messages received and local calculations), and in the context of partially synchronous systems(with timing, and the coincidence of time at nodes is not necessary — the main thing is that they equally count the rate of time flow).
Implications of the theorem
A consequence of the theorem for asynchronous systems is that only three combinations of consistency, availability, and partition tolerance are possible:
AP. Systems of this type respond to queries, but the data returned may not always be up-to-date, with slower updates of the data but "always" available. Examples of such a system are DNS, DynamoDB, and Cassandra.
CP. Systems of this type always return up-to-date data, but some, or even all, nodes in the system may not respond if partitioned. It gives atomic updates but can lead to timeouts. NoSQL databases such as Google BigTable, MongoDB, HBase, and Redis are all systems of this type.
CA. Systems of this type always return up-to-date data when there are no partitions. Because of the last limitation, usually, such systems are used only within one machine. Examples are classical relational databases.
In reality, we choose between CP and AP because CA is a monolith without partitions. For large-scale systems, designers cannot abandon P and therefore have a difficult choice between C and A.
In CA, node failure means complete unavailability of the service. But this does not disable scalability, since we can clone independent monoliths and distribute the load over them
Critique of the CAP Theorem
There are quite a few criticisms of the CAP theorem. For all its seemingly clear concept of triple constraint, the CAP theorem is criticized for oversimplifying important concepts, leading to a misunderstanding of its original meaning.
For example, in fact, the choice between consistency and availability is really only made when partitioning or failure occurs. At other times, no tradeoffs are required. Moreover, these choices can happen many times within the same system, the choices may change depending on the operation or even the specific data or the user.
Furthermore, in its original definition, the CAP theorem ignores time delays, although in practice latency and partitioning are deeply connected. In practice, the CAP nature occurs during a timeout, the period when the system must make a fundamental decision.
Finally, all three properties are more continuous than binary. Obviously, availability is continuous from 0 to 100 percent, but there are also many levels of consistency, and even partitions are nuanced, including disagreements within the system about whether a partition exists.
PACELC Theorem
The PACELC theorem described by Daniel J. Abadi is considered an alternative approach to the design of distributed systems. It is based on the CAP model, but in addition to consistency, availability, and partition tolerance it also includes latency and logical exclusion between combinations of these concepts.
Abadi divides system operation into two modes — operation without and with partitioning. Accordingly, the choices for the designer of a distributed system are as follows: in the case of partitioning, one must choose between availability and consistency, otherwise, that is, in the case of normal system operation, one must choose between reduced response time and consistency.
Conclusion
The formulation of the CAP theorem has been a significant event in the community, and studies of its impact on distributed systems design have shown that designers of distributed systems should not limit the system to two properties - they should strive to maximize the guarantees required in each particular case. For this purpose, it is reasonable to divide the system into segments, each of which has its own requirements, and to design the system based on the requirements of each of the segments.
The requirements of the CAP theorem apply to distributed systems in general, not only to databases. Today, however, there is no need for such a categorical division of systems into CA/CP/AP classes as the CAP theorem provides. Moreover, such classification is even considered erroneous, since the broad interpretation of CAP does not take into account the needs of modern distributed databases and Big Data applications, and therefore its application to these systems is not quite correct.
PACELC extends and clarifies the CAP theorem, regulating the need to find a compromise between time delay and data consistency in distributed Big Data systems.
Additional materials
- CAP Twelve Years Later: How the "Rules" Have Changed
- Cassandra: The Definitive Guide
- Please stop calling databases CP or AP and A Critique of the CAP Theorem by Martin Kleppmann