As Cassandra only allows each table to have one primary key, it supports secondary index on columns other than those in the primary key. The benefit is a fast, efficient lookup of data matching the indexed columns in the WHERE
clause. Each table can have more than one secondary index. Cassandra uses secondary indexes to find the rows that are not using the row key. Behind the scenes, the secondary index is implemented as a separate, hidden table that is maintained automatically by the internal process of Cassandra. As with relational databases, keeping secondary indexes up to date is not free, so unnecessary indexes should be avoided.
Therefore, the secondary index will not be able to know immediately the locations of all matched rows without having examined all the nodes in the cluster. This makes the performance of the secondary index unpredictable.
In addition, the secondary index cannot be created on the primary key itself.
We can use the CQL CREATE INDEX
statement to create an index on a column after we define a table. For example, we might want to add a column sector
to indicate the sector that the stock belongs to, as shown in the following screenshot:
If we want to search dayquote06
for symbols that belong to Properties
, we might run the command, as shown in the following screenshot:
As sector
is not in the primary key, we cannot query Cassandra directly by sector
. Instead, we can create a secondary index on the column sector
to make this possible, as shown in the following screenshot:
The index name dayquote06_sector_idx
is optional, but must be unique within the keyspace. Cassandra assigns a name such as dayquote06_idx
if you do not provide a name. We can now query Cassandra for daily stock quotes by sector
.
You can see that the columns in the primary key are not present in the WHERE
predicate clause in the previous screenshot and Cassandra uses the secondary index to look for the rows matching the selection condition.
Cassandra supports multiple secondary indexes on a table. The WHERE
clause is executed if at least one column is involved in a secondary index. Thus, we can use multiple conditions in the WHERE
clause to filter the results. When multiple occurrences of data match a condition in the WHERE
predicate clause, Cassandra selects the least frequent occurrence of a condition to process first so as to have a better query efficiency.
When a potentially expensive query is attempted, such as a range query, Cassandra requires the ALLOW FILTERING
clause, which can apply additional filters to the result set for values of other non-indexed columns. It works very slowly because it scans all rows in all nodes. The ALLOW FILTERING
clause is used to explicitly direct Cassandra to execute that potentially expensive query on any WHERE
clause without creating secondary indexes, despite unpredictability of the performance.
The secondary index is best on a table that has many rows that contain fewer unique values, that is low cardinality in the relational database terminologies, which is counterintuitive to the relational people. The more unique values that exist in a particular column, the more overhead you will have to query and maintain the index. Hence, it is not suitable for querying a huge volume of records for a small number of results.
Do index the columns with values that have low cardinality. Cassandra stores secondary indexes only for local rows in the data node as a hash-multimap or as bitmap indexes, you can refer to it at https://issues.apache.org/jira/browse/CASSANDRA-1472.
Secondary indexes should be avoided in the following situations:
An index on a high-cardinality column will incur many seeks for very few results. For columns containing unique values, using an index for convenience is fine from a performance perspective, as long as the query volume to the indexed column family is moderate and not under constant load.
Cassandra stores tombstones (a marker in a row that indicates that a column was deleted. During compaction, marked columns are deleted in the index (a hidden table) until the tombstone limit reaches 100 K cells. After exceeding this limit, the query that uses the indexed value will fail.
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines get added to the cluster.