Data modeling considerations

Apart from modeling by query, we need to bear in mind a few important points when designing a Cassandra data model. We can also consider a few good patterns that will be introduced in this section.

Data duplication

Denormalization is an evil in a relational data model, but not in Cassandra. Indeed, it is a good and common practice. It is solely based on the fact that Cassandra does not use high-end disk storage subsystem. Cassandra loves commodity-grade hard drives, and hence disk space is cheap. Data duplication as a result of denormalization is by no means a problem anymore; Cassandra welcomes it.

Sorting

In a relational database, sorting can be easily controlled using the ORDER BY clause in a SQL query. Alternatively, a secondary index can be created to further speed up the sorting operations.

In Cassandra, however, sorting is by design because you must determine how to compare data for a column family at the time of its creation. The comparator of the column family dictates how the rows are ordered on reads. Additionally, columns are ordered by their column names, also by a comparator.

Wide row

It is common to use wide rows for ordering, grouping and efficient filtering. Besides, you can use skinny rows. All you have to consider is the number of columns the row contains.

It is worth noting that for a column family storing skinny rows, the column key is repeatedly stored in each column. Although it wastes some storage space, it is not a problem on inexpensive commodity hard disks.

Bucketing

Even though a wide row can accommodate up to 2 billion variable columns, it is still a hard limit that cannot prevent voluminous data from filling up a node. In order to break through the 2 billion column limit, we can use a workaround technique called bucketing to split the data across multiple nodes.

Bucketing requires the client application to generate a bucket ID, which is often a random number. By including the bucket ID into a composite partition key, you can break up and distribute segments of the data to different nodes. However, it should not be abused. Breaking up the data across multiple nodes causes reading operations to consume extra resources to merge and reorder data. Thus, it is expensive and not a favorable method, and therefore should only be a last resort.

Valueless column

Column keys can store values as shown in the Modeling by query section. There is no "Not Null" concept in Cassandra such that column values can store empty values without any problem. Simply storing data in column keys while leaving empty values in the column, known as a valueless column, is sometimes used purposely. It's a common practice with Cassandra.

One motivation for valueless columns is the sort-by-column-key feature of Cassandra. Nonetheless, there are some limitations and caveats. The maximum size of a column key is 64 KB, in contrast to 2 GB for a column value. Therefore, space in a column key is limited. Furthermore, using timestamp alone as a column key can result in timestamp collision.

Time-series data

What is time-series data? It is anything that varies on a temporal basis such as processor utilization, sensor data, clickstream, and stock ticker. The stock quote data model introduced earlier is one such example. Cassandra is a perfect fit for storing time-series data. Why? Because one row can hold as many as 2 billion variable columns. It is a single layout on disk, based on the storage model. Therefore, Cassandra can handle voluminous time-series data in a blazing fast fashion. TTL is another excellent feature to simplify data housekeeping.

In the second half of this book, a complete stock quote technical analysis application will be developed to further explain the details of using Cassandra to handle time-series data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset