Chapter 1. Bird's Eye View of Cassandra

Imagine that we have turned back the clock to the 1990s and you an application architect. Whenever you were required to select a suitable database technology for your applications, what kind of database technology would you choose? I bet 95 percent (or more) of the time you would select relational databases.

Relational databases have been the most dominating data management solution since the 1970s. At that time, the application system was usually silo. The users of the application and their usage patterns were known and under control. The workload that had to be catered for by the relational database could be determined and estimated. Apart from the workload consideration, the data model can also be structured in normalized forms as recommended by the relational theory. Moreover, relational databases provide many benefits such as support of transactions, data consistency, and isolation. Relational databases just fit perfectly for the purposes. Therefore, it is not difficult to understand why the relational database has been so popular and why it is the de facto standard for persistent data stores in application development.

Nonetheless, with the proliferation of the Internet and the numerous web applications running on it, the control of the users and their usage patterns (hence the scale), the workload generated, and the flexibility of the data model were gone. Typical examples of these web applications were global e-commerce websites, social media sites, video community websites, and so on. They generated a tremendous amount of data in a very short period of time. It should also be noted that the data generated by these applications were not only structured, but also semi-structured and even unstructured. Since relational databases were the de facto standard at that time, developers and architects did not have many alternatives but were forced to tweak them to support these web applications, even though they knew that relational databases were suboptimal and had many limitations. It became apparent that a different kind of enabling technology should be found to break through the challenges.

We are in an era of information explosion, as a result of the ever-increasing amount of user-generated data and content on the Web and mobile applications. The generated data is not only large in volume and fast in velocity but it is also diversified in variety. Such rapidly growing data of different varieties is often termed as Big Data.

No one has a clear, formal definition of Big Data. People, however, unanimously agree that the most fundamental characteristics of Big Data are related to large volume, high velocity, and great variety. Big Data imposes real, new challenges to the information systems that have adopted traditional ways of handling data. These systems are not designed for web-scale and for being enhanced to do so, cost effectively. Due to this, you might find yourself asking whether or not we have any alternatives.

Challenges come with opportunities on the flip side. A new breed of data management products was born. The most recent answer to the question in the last paragraph is NoSQL.

What is NoSQL?

The need to tackle the Big Data challenges has led to the emergence of new data management technologies and techniques. Such technologies and techniques are rather different from the ubiquitous relational database technology that has been used for over 40 years. They are collectively known as NoSQL.

NoSQL is an umbrella term for the data stores that are not based on the relational data model. It encompasses a great variety of many different database technologies and products. As shown in the following figure, The Data Platforms Landscape Map, there are over 150 different database products that belong to the non-relational school as mentioned in http://nosql-database.org/. Cassandra is one of the most popular ones. Other popular NoSQL database products are, just to name a few, MongoDB, Riak, Redis, Neo4j, so on and so forth.

What is NoSQL?

The Data Platforms Landscape Map (Source: 451 Research)

So, what kinds of benefits are provided by NoSQL? When compared to the relational database, NoSQL overcomes the weaknesses that the relational data model does not address well, which are as follows:

  • Huge volume of structured, semi-structured, and unstructured data
  • Flexible data model (schema) that is easy to change
  • Scalability and performance for web-scale applications
  • Lower cost
  • Impedance mismatch between the relational data model and object-oriented programming
  • Built-in replication
  • Support for agile software development

Note

Limitations of NoSQL Databases

Many NoSQL databases do not support transactions. They use replication extensively so that the data in the cluster might be momentarily inconsistent (although it is eventually consistent). In addition, the range queries are not available in NoSQL databases. Furthermore, a flexible schema might lead to problems with efficient searches.

The huge volume of structured, semi-structured, and unstructured data was mentioned earlier. What I want to dive deeper into here is that different NoSQL databases provide different solutions for each of them. The primary factor to be considered is the NoSQL database type, which will be introduced in the subsequent section.

All NoSQL databases provide a flexible data model that is easy to change and some might be even schemaless. In a relational database, the relational data model is called schema. You need to understand the data to be stored in a relational database, design the data model according to the relational database theory, and define the schema upfront in the relational database before you can actually store data inside it. It is a very structured approach for structured data. It is a prescriptive data modeling process. It is absolutely fine if the data model is stable, because there are not many changes required. But what if the data model keeps changing in the future and you do not know what needs to be changed? You cannot prescribe comprehensively in advance. It leads to many inevitable remedies; say, data patching for example, to change the schema.

Conversely, in NoSQL databases, you need not prescribe comprehensively. You only need to describe what is to be stored. You are not bound by the relational database theory. You are allowed to change the data model whenever necessary. The data model is schemaless and is a living object. It evolves as life goes on. It is a descriptive data modeling process.

Scalability and performance for web-scale applications refer to the ability of the system to be scaled, preferably horizontally, to support web-scale workloads without considerably deteriorating system performance. Relational databases can only be scaled out to form a cluster consisting of a very small number of nodes. It implies the rather low ceiling imposed on these web-scale applications using relational databases. In addition, changing the schema in a clustered relational database is a big task of high complexity. The processing power required to do this is so significant that the system performance cannot be unaffected. Most NoSQL databases were created to serve web-scale applications. They natively support horizontal scaling without very little degrade on the performance.

Now let us talk about money. Traditionally, most high-end relational databases are commercial products that demand their users to pay huge software license fees. Besides, to run these high-end relational databases, the underlying hardware servers are usually high-end as well. The result is that the hardware and software costs of running a powerful relational database are exceptionally large. In contrast, NoSQL databases are open source and community-driven in a majority, meaning that you need to pay the software license cost, which is an order of magnitude less than other databases. NoSQL databases are able to run on commodity machines that will lead to a possible churn, or crashes. Therefore, the machines are usually configured to be a cluster. High-end hardware servers are not needed and so the hardware cost is tremendously reduced. It should be noted that when NoSQL databases are put into production, some cost of the support is still required but it is definitely much less when compared to that of commercial products.

There exists a generation gap between the relational data model and object-oriented programming. The relational data model was the product of 1970s, whereas object-oriented programming became very popular in 1990s. The root cause, known as impedance mismatch, is an inherent difficulty of representing a record or a table in a relational data model with the object-oriented model. Although there are resolutions for this difficulty, most application developers still feel very frustrated to bring the two together.

Note

Impedance Mismatch

Impedance mismatch is the difference between the relational model and the in-memory data structures that are usually encountered in object-oriented programming languages.

Built-in replication is a feature that most NoSQL databases provide to support high availability in a cluster of many nodes. It is usually automatic and transparent to the application developers. Such a feature is also available in relational databases, but the database administrators must struggle to configure, manage, and operate it by themselves.

Finally, relational databases do not support agile software development very well. Agile software development is iterative by nature. The software architecture and data model emerge and evolve as the project proceeds in order to deliver the product incrementally. Hence, it is conceivable that the need of changing the data model to meet the new requirements is inevitably frequent. Relational databases are structured and do not like changes. NoSQL can provide such flexibility for agile software development teams by virtue of its schemaless characteristic. Even better, NoSQL databases usually allow the changes to be implemented in real time without any downtime.

NoSQL Database types

Now you know the benefits of NoSQL databases, but the products that fall under the NoSQL databases umbrella are quite varied. How can you select the right one for yourself among so many NoSQL databases? The selection criteria of which NoSQL database fits your needs is really dependent on the use cases at hand. The most important factor to consider here is the NoSQL database type, which can be subdivided into four main categories:

  • Key/value pair store
  • Column-family store
  • Document-based repository
  • Graph database

The NoSQL database type dictates the data model that you can use. It is beneficial to understand each of them deeper.

Key/value pair store

Key/value pair is the simplest NoSQL database type. Key/value store is similar to the concept of Windows registry, or in Java or C#, a map, a hash, a key/value pair. Each data item is represented as an attribute name, also a key, together with its value. It is also the basic unit stored in the database. Examples of the NoSQL databases of key/value pair type are Amazon Dynamo, Berkeley DB, Voldemort and Riak.

Internally, key/value pairs are stored in a data structure called hashmap. Hashmap is popular because it provides very good performance on accessing data. The key of a key/value pair is unique and can be searched very quickly.

Key/value pair can be stored and distributed in the disk storage as well as in memory. When used in memory, it can be used as a cache, which depends on the caching algorithm, can considerably reduce disk I/O and hence boost up the performance significantly.

On the flip side, key/value pair has some drawbacks, such as lack of support of range queries, no way to operate on multiple keys simultaneously, and possible issues with load balancing.

Column-family store

A column in this context is not equal to a column in a relational table. In the NoSQL world, a column is a data structure that contains a key, value, and timestamp. Thus, it can be regarded as a combination of key/value pair and a timestamp. Examples are Google BigTable, Apache Cassandra, and Apache HBase. They provide optimized performance for queries over very large datasets.

Column-family store is basically a multi-dimensional map. It stores columns of data together as a row, which is associated with a row key. This contrasts with rows of data in a relational database. Column-family store does not need to store null columns, as in the case of a relational database and so it consumes much less disk space. Moreover, columns are not bound by a rigid schema and you are not required to define the schema upfront.

The key component of a column is usually called the primary key or the row key. Columns are stored in a sorted manner by the row key. All the data belonging to a row key is stored together. As such, read and write operations of the data can be confined to a local node, avoiding unnecessary inter-node network traffic in a cluster. This mechanism makes the data lookup and retrieval extremely efficient.

Obviously, a column-family store is not the best solution for systems that require ACID transactions and it lacks the support for aggregate queries provided by relational databases such as SUM().

Document-based repository

Document-based repository is designed for documents or semi-structured data. The basic unit of a document-based repository associates each key, a primary identifier, with a complex data structure called a document. A document can contain many different key-value pairs, or key-array pairs, or even nested documents. Therefore, document-based repository does not adhere to a schema. Examples are MongoDB and CouchDB.

In practice, a document is usually a loosely structured set of key/value pairs in the form of JavaScript Object Notation (JSON). Document-based repository manages a document as a whole and avoids breaking up a document into fragments of key/value pairs. It also allows document properties to be associated with a document.

As a document database does not adhere to a fixed schema, the search performance is not guaranteed. There are generally two approaches to query a document database. The first is to use materialized views (such as CouchDB) that are prepared in advance. The second is to use indexes defined on the document values (such as MongoDB) that behave in the same way as a relational database index.

Graph database

Graph databases are designed for storing information about networks, such as a social network. A graph is used to represent the highly connected network that is composed of nodes and their relationships. The nodes and relationships can have individual properties. The prominent graph databases include Neo4J and FlockDB.

Owing to the unique characteristics of a graph, graph databases commonly provide APIs for rapid traversal of graphs.

Graph databases are particularly difficult to be scaled out with sharding because traversing a graph of the nodes on different machine does not provide a very good performance. It is also not a straightforward operation to update all or a subset of the nodes at the same time.

So far, you have grasped the fundamentals of the NoSQL family. Since this book concentrates on Apache Cassandra and its data model, you need to know what Cassandra is and have a basic understanding of what its architecture is, so that you can select and leverage the best available options when you are designing your NoSQL data model and application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset