There is no doubt that Cassandra can store a gigantic volume of data effortlessly. However, if we cannot efficiently look for what we want in such a data abyss, it is meaningless. Cassandra provides very good support to search and retrieve the desired data by the primary index and secondary index.
In this chapter, we will look at how Cassandra uses the primary index and the secondary index to spotlight the data. After developing an understanding of them, we can then design a high-performance data model.
Cassandra is a column-based database. Each row can have different number of columns. A cell is the placeholder of the value and the timestamp data is identified by a row and column. Each cell can store values that are less than 2 GB. The rows are grouped by partitions. The maximum number of cells per partition is limited to the condition that the number of rows times the number of columns is less than 2 billion. Each row is identified by a row key that determines which machine stores the row. In other words, the row key determines the node location of the row. A list of row keys of a table is known as a primary key. A primary index is just created on the primary key.
A primary key can be defined on a single column or multiple columns. In either case, the first component of a table's primary key is the partition key. Each node stores a data partition of the table and maintains its own primary key for the data that it manages. Therefore, each node knows what ranges of row key it can manage and the rows can then be located by scanning the row indexes only on the relevant replicas. The range of the primary keys that a node manages is determined by the partition key and a cluster-wide configuration parameter called partitioner. Cassandra provides three choices to partitioner that will be covered later in this chapter.
A primary key can be defined by the CQL keywords PRIMARY KEY
, with the column(s) to be indexed. Imagine that we want to store the daily stock quotes into a Cassandra table called dayquote01
. The CREATE TABLE
statement creates a table with a simple primary key that involves only one column, as shown in the following screenshot:
The symbol
field is assigned the primary key of the dayquote01
table. This means that all the rows of the same symbol are stored on the same node. Hence, this makes the retrieval of these rows very efficient.
Alternatively, the primary key can be defined by an explicit PRIMARY KEY
clause, as shown in the following screenshot:
Unlike relational databases, Cassandra does not enforce a unique constraint on the primary key, as there is no primary key violation in Cassandra. An INSERT
statement using an existing row key is allowed. Therefore, in CQL, INSERT
and UPDATE
act in the same way, which is known as UPSERT. For example, we can insert two records into the table dayquote01
with the same symbol and no primary key violation is alerted, as shown in the following screenshot:
The returned query result contains only one row, not two rows as expected. This is because the primary key is the symbol and the row in the latter INSERT
statement overrode the record that was created by the former INSERT
statement. There is no warning for a duplicate primary key. Cassandra simply and quietly updated the row. This silent UPSERT behavior might sometimes cause undesirable effects in the application logic.
In fact, the reason why Cassandra behaves like this becomes more clear when we know how the internal storage engine stores the row, as shown by Cassandra CLI in the following screenshot:
The row key is 0001.HK
. It is used to locate which node is used to store the row. Whenever we insert or update the row of the same row key, Cassandra blindly locates the row and modifies the columns accordingly, even though an INSERT
statement has been used.
Although a single column primary key is not uncommon, a primary key composed of more than one column is much more practical.