Chapter 4. Indexes

There is no doubt that Cassandra can store a gigantic volume of data effortlessly. However, if we cannot efficiently look for what we want in such a data abyss, it is meaningless. Cassandra provides very good support to search and retrieve the desired data by the primary index and secondary index.

In this chapter, we will look at how Cassandra uses the primary index and the secondary index to spotlight the data. After developing an understanding of them, we can then design a high-performance data model.

Primary index

Cassandra is a column-based database. Each row can have different number of columns. A cell is the placeholder of the value and the timestamp data is identified by a row and column. Each cell can store values that are less than 2 GB. The rows are grouped by partitions. The maximum number of cells per partition is limited to the condition that the number of rows times the number of columns is less than 2 billion. Each row is identified by a row key that determines which machine stores the row. In other words, the row key determines the node location of the row. A list of row keys of a table is known as a primary key. A primary index is just created on the primary key.

A primary key can be defined on a single column or multiple columns. In either case, the first component of a table's primary key is the partition key. Each node stores a data partition of the table and maintains its own primary key for the data that it manages. Therefore, each node knows what ranges of row key it can manage and the rows can then be located by scanning the row indexes only on the relevant replicas. The range of the primary keys that a node manages is determined by the partition key and a cluster-wide configuration parameter called partitioner. Cassandra provides three choices to partitioner that will be covered later in this chapter.

A primary key can be defined by the CQL keywords PRIMARY KEY, with the column(s) to be indexed. Imagine that we want to store the daily stock quotes into a Cassandra table called dayquote01. The CREATE TABLE statement creates a table with a simple primary key that involves only one column, as shown in the following screenshot:

Primary index

The symbol field is assigned the primary key of the dayquote01 table. This means that all the rows of the same symbol are stored on the same node. Hence, this makes the retrieval of these rows very efficient.

Alternatively, the primary key can be defined by an explicit PRIMARY KEY clause, as shown in the following screenshot:

Primary index

Unlike relational databases, Cassandra does not enforce a unique constraint on the primary key, as there is no primary key violation in Cassandra. An INSERT statement using an existing row key is allowed. Therefore, in CQL, INSERT and UPDATE act in the same way, which is known as UPSERT. For example, we can insert two records into the table dayquote01 with the same symbol and no primary key violation is alerted, as shown in the following screenshot:

Primary index

The returned query result contains only one row, not two rows as expected. This is because the primary key is the symbol and the row in the latter INSERT statement overrode the record that was created by the former INSERT statement. There is no warning for a duplicate primary key. Cassandra simply and quietly updated the row. This silent UPSERT behavior might sometimes cause undesirable effects in the application logic.

Tip

Hence, it is very important for an application developer to handle duplicate primary key situations in the application logic. Do not rely on Cassandra to check the uniqueness for you.

In fact, the reason why Cassandra behaves like this becomes more clear when we know how the internal storage engine stores the row, as shown by Cassandra CLI in the following screenshot:

Primary index

The row key is 0001.HK. It is used to locate which node is used to store the row. Whenever we insert or update the row of the same row key, Cassandra blindly locates the row and modifies the columns accordingly, even though an INSERT statement has been used.

Although a single column primary key is not uncommon, a primary key composed of more than one column is much more practical.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset