The following article analyses the applicability of the CAP theorem to Big Data. I will explain the CAP theorem, explore the three of its characteristics, as well as provide the proof of the CAP theorem on an example that is closely related to Big Data use case. I will also briefly discuss couple of possible ways deal with the CAP-related issues in distributed Big Data applications and offer overview of those implementations that best fit each of the CAP properties.
What is a CAP theorem?
Eric Brewer proposed the CAP theorem in early 2000. It essentially asserts that a distributed computer system cannot concurrently provide more than two of the three of the following assurances:
- Partition Tolerance
I have created the following diagram (Figure 1) to clarify, that only two of the three attributes of the distributed system can be supported simultaneously. I found that CAP theorem diagrams of this type almost always missed illustrating two of the most important factors. One is that that intersection of all three attributes should never be available (red cross area), and secondly, CAP theorem always refers to distributed networks (multiple network connected server nodes).
Promises of CAP Theorem
So, let’s briefly explain the above illustration and three of the promises that are summarized by the CAP theorem.
Consistency (also called ‘Atomic consistency’)
In a CAP theorem, the consistency refers to a characteristic of a sole request/response operation sequence. I will try to explain this in as layman terms as possible. The consistency in CAP theorem essentially guarantees, that if we write the information to a Server Node #1, then requesting the information from Server Node #2 will always return the same or newer data than the one we have saved to the Server Node #1.
And that is fundamentally the principle of consistency. Simply said, for the distributed networks to be always consistent, all data need to be consistently accurate across an entire distributed server structure. It is important to note that in a consistent network of server nodes it is entirely possible to receive a newer transaction to the node we are working with (and to which we’re connected to), but never any information that is outdated (older) than the information we have saved during the last operation.
The availability part of the CAP theorem is identically simple to explain. When we refer to availability of a distributed system, we’re in essence discussing an accessibility and readiness promise of the CAP theorem, which in terms of a distributed network of servers always means, that even if the server node we are trying to communicate has failed, we will always be connected to another node, that is available, contains the same or newer information and responds to our queries.
As should be obvious by now, the CAP theorem discussed a distributed network, so we’re always talking about more than just a single node. The Partition Tolerance in the distributed network speaks to the fact that server nodes must continue working even if there is a communications breakdown between two of the network nodes.
That should not be hard to clarify. Suppose we have two datacenters linked by a WAN (wide area network). Now imagine, that the WAN connection was severed and suddenly disconnected. At this point, we would fail the Partition Tolerance of CAP theorem, because two of the data centers can no longer communicate. In other words, a node in one datacenter cannot exchange the information with the rest of the network in the other datacenter.
In other to meet the Partition Tolerance characteristic of CAP theorem, the partitioned network must always guarantee, that information can flow from one server node to another, without any issues.
Proof of CAP Theorem
To recap the CAP theorem in relation to Big Data distributed solutions (such as NoSQL databases), it is important to reiterate the fact, that in such distributed systems it is not possible to guarantee all three characteristics (Availability, Consistency, and Partition Tolerance) all at the same time.
We can only have at most two of these characteristics, but there is not an option to achieve all three of them concurrently.
Let’s do the proof of CAP theorem on an example. Suppose we have our two datacenters again, where each of the data centers contains a single server node that hosts a distributed NoSQL database (Figure 2).
A + P
Let’s assume that in our example, we have achieved Availability (the two data centers will never go down) as well as Partition Tolerance (the NoSQL nodes in both datacenters will work even if they cannot communicate).
That is great, but it immediately means that we cannot promise Consistency since NoSQL nodes will go out of sync as soon as we write new information to one of the NoSQL nodes. Meaning, the nodes will continue to accept the database transactions (separately), but won’t be able to transfer the transaction between each other keeping them in sync. And that means that our system can be available and partition tolerant, but cannot simultaneously guarantee the system consistency.
A + C
This situation is practically impossible to achieve in a distributed system such the one outlined in our example.
We can have the both data centers online and Available, as well as keep the data Consistency between two of the NoSQL nodes, but the nature of WAN networks is that they can go down, meaning we cannot guarantee the Partition Tolerance.
C + P
This can be demonstrated as follows. Let’s assume that we have the data Consistency, or in other words, that the data between two of our NoSQL nodes always contain the up-to-date, consistent information. Moreover, let’s also assume that we are also able to uphold the Partition Tolerance (keeping WAN connection fault tolerant and avoiding any desynchronization of our data).
This part of CAP theorem speaks to a fact that we will lose Availability as soon as one of the two NoSQL nodes goes down, proving that we cannot be consistent, partition tolerant as well as always available.
CAP Theorem and Big Data Solutions
As we have just learned, the CAP theorem declares that any data system that lives in a network, can at most achieve two of the three most desired properties.
That said, it depends on the solution we are trying to architect. The question should always be: “What do we want to do with Big Data?”, because solutions are tailored for each of the three characteristics.
Let’s look at some of the examples:
Availability + Partition Tolerance
Examples of AP NoSQL Databases:
- Key-Value Oriented (Dynamo, Voldemort, Tokyo Cabinet, KAI)
- Column Oriented (Cassandra)
- Document Oriented (SimpleDB, CouchDB, Riak)
Availability and Consistency
Examples of AC NoSQL Databases:
- Relational (MySQL, PostgreSQL, Aster Data, Greenplum)
- Column Oriented (Vertica)
Consistency and Partition Tolerance
Examples of CP NoSQL Databases:
- Key-Value Oriented (Scalaris, Berkeley DB, MemchacheDB, Redis)
- Column Oriented (Big Table, HyperTable, HBase)
- Document Oriented (MongoDB, Terrastore)
As per Eric Brewer, father of CAP theorem, we are limited to two of three characteristics, which one we choose is up to us, however, “by explicitly handling partitions, designers can optimize consistency and availability, thereby achieving some trade-off of all three.” (Brewer, E., 2012). Regarding Big Data solutions, CAP theorem can be quite deceptive, because nowadays the WAN interruptions are infrequent, which means that there is no good reason to give up Consistency or Availability of our distributed system, as it is going to be extremely rare that our Partition Tolerance will ever be disrupted. I haven’t found a better conclusion to this topic, than the statement of Eric Brewer in his revision of CAP theorem from 2012, in which he concludes that “All three properties are more continuous than binary. Availability is continuous from 0 to 100 percent, there are many levels of consistency, and even partitions have nuances. Exploring these nuances requires pushing the traditional way of dealing with partitions, which is the fundamental challenge. Because partitions are rare, CAP should allow perfect C and A most of the time, but when partitions are present or perceived, a strategy is in order.” (Brewer, E., 2012).
Brewer, E. (2012). CAP twelve years later: How the” rules” have changed. Computer, 45(2), 23-29. (Accessed: 27 January 2017).
Pokorny, J. (2013). NoSQL databases: a step to database scalability in a web environment. International Journal of Web Information Systems, 9(1), 69-82. (Accessed: 27 January 2017).
Bailis, P., & Ghodsi, A. (2013). Eventual consistency today: Limitations, extensions, and beyond. Communications of the ACM, 56(5), 55-63. (Accessed: 28 January 2017).
Gajendran, S. K. (2012). A survey on NoSQL databases. The university of Illinois. (Accessed: 28 January 2017).
Bailis, P., Davidson, A., Fekete, A., Ghodsi, A., Hellerstein, J. M., & Stoica, I. (2013). Highly available transactions: Virtues and limitations. Proceedings of the VLDB Endowment, 7(3), 181-192. (Accessed: 28 January 2017).
Gilbert, S., & Lynch, N. (2012). Perspectives on the CAP Theorem. Computer, 45(2), 30-36. (Accessed: 28 January 2017).
Brewer, E. (2010, July). A certain freedom: thoughts on the CAP theorem. In Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing (pp. 335-335). ACM. (Accessed: 29 January 2017).
Hugg, J. (2015) Disambiguating ACID and CAP. Available at: https://www.voltdb.com/blog/disambiguating-acid-and-cap (Accessed: 29 January 2017).
Availability (2017) in Wikipedia. Available at: https://en.wikipedia.org/wiki/Availability (Accessed: 29 January 2017).
NoSQL: What’s in it for me? – 推酷 (2013) Available at http://www.tuicool.com/articles/AFjUZn (Accessed: 29 January 2017).
Perry, M. (2010) CAP theorem. Available at: https://www.youtube.com/watch?v=Jw1iFr4v58M (Accessed: 29 January 2017).