The problem of scalability and the CAP theorem

The requirement for a system to be scalable means that a system that supports a business now should also be able to support the same business with the same quality of service when it grows. A database can store 1 GB of data and effectively process 100 queries per second. What if the business grows 100 times? Will it be able to support 10,000 queries per second processing 100 GB of data? Maybe not now and in not the same installation. However, a scalable solution should be ready to be expanded to be able to handle the load as soon as it is needed.

Usually, scalability comes together with the distributed architecture of a system. If the database was able to utilize the power of multiple computers, then in order to scale it, one would need to add more computers to the cluster. There are solutions that do this. In fact, many NoSQL databases (even if they implement SQL) are distributed systems. One of the most spectacular examples of a scalable solution is Cassandra--a distributed NoSQL database. There are clusters of tens of thousands of nodes, operating petabytes of data and performing hundreds of millions of operations per day. This could never be possible for a single server. 

On the other hand, these systems are less flexible in functionality. Simply speaking, Cassandra provides a key-value storage and does not provide consistency in the sense of ACID principles. PostgreSQL, in contrast, is a relational database management system that supports big concurrent isolated transactions and integrity constraints.

There is a theorem in computer science that was formulated by Eric Brewer in 2000, called the CAP theorem, stating that a distributed data storage system can support only two of the three features:

  • Consistency (defined differently from the ACID principles): Any read operation always returns the most current state of the data, that is, after the last write operation, regardless of which node was queried
  • Availability: Any request to the system is always successful (but the result may not be consistent)
  • Partition tolerance: A system can survive even if some parts of it are not available or if a cluster is separated

The theorem is named for the first letters of these features: CAP. It says that all three features are not possible at the same time. Having in mind a distributed system, one may expect that it should tolerate the unavailability of some nodes. To achieve consistency, the system has to be able to coordinate all reads and writes, even if they happen on different nodes. To guarantee availability, the system needs to have copies of the same data on different nodes and to keep them in sync. In this definition, these two features cannot be achieved at the same time when some nodes are not available.

Cassandra provides availability and partition tolerance, so it is cAP:

  • Not consistent: It is possible that a read operation performed on one node does not see the data written on another node, until the data is replicated to all the nodes that are supposed to have it. This will not block the read operation.
  • Available: Read and write operations are always successful. 
  • Partition tolerant: If a node breaks, it does not break the whole database. When the broken node is returned to the cluster, it receives the missing changes in the data automatically.

One can increase the consistency level when working with Cassandra to make it absolutely consistent, but in this case, it will not be partition tolerant.

The relational databases that comply to the ACID principles are consistent and available. PostgreSQL is CAp:

  • Consistent: Once a sessions commits a transaction, all other sessions immediately can see the results. Intermediate states of the data are not visible. 
  • Available: When the database is running, all the queries will succeed if they are correct. 
  • Not partition tolerant: If one makes part of the data not available, the database will stop working.

For example, for banking operations, consistency is required. If money is transferred from one account to another, we expect that either both accounts' balances are updated or none. If, for some reason, the application that performs the transfer crashes after it has updated one of the accounts, the whole transaction will be canceled and the data will return to the previous consistent state. Availability of the data is also important. Even if the transaction is committed, what if the hard disk fails and the whole database is lost? Therefore, all the operations have to be replicated into a backup storage and the transaction can only be considered successful when it is consistent and it is also durable. The money transfer in a bank can take some time because consistency is extremely important and the performance has lower priority.

If the online banking system is not available for a while because they need to restore a database from a backup, the customers can tolerate the inconvenience because they do not want to lose their money.

On the other hand, if someone likes a picture on Instagram, the system tracks this action in the context of the picture and also in the context of the user who likes. If any of the two operations fails, the data will be not consistent, but that is not critical. This does not mean that the data Instagram operates is less valuable. The requirements are different. There are millions of users and billions of pictures and nobody wants to wait when using the service. If likes are not displayed correctly, it does not matter too much, but if the whole system is not available for some time to be recovered to a consistent state after a failure, the users may decide to leave.

This means different solutions for different business requirements, but unfortunately there is a natural limitation making it not possible to achieve everything at once. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset