Secondary indexes

As Cassandra only allows each table to have one primary key, it supports secondary index on columns other than those in the primary key. The benefit is a fast, efficient lookup of data matching the indexed columns in the WHERE clause. Each table can have more than one secondary index. Cassandra uses secondary indexes to find the rows that are not using the row key. Behind the scenes, the secondary index is implemented as a separate, hidden table that is maintained automatically by the internal process of Cassandra. As with relational databases, keeping secondary indexes up to date is not free, so unnecessary indexes should be avoided.

Note

The major difference between a primary index and a secondary index is that the primary index is a distributed index used to locate the node that stores the row key, whereas the secondary index is a local index just to index the data on the local node.

Therefore, the secondary index will not be able to know immediately the locations of all matched rows without having examined all the nodes in the cluster. This makes the performance of the secondary index unpredictable.

Note

The secondary index is the most efficient when using equality predicates. This is indeed a limitation that must have at least one equality predicate clause to hopefully limit the set of rows that need to be read into memory.

In addition, the secondary index cannot be created on the primary key itself.

Note

Caveat!

Secondary indexes in Cassandra are NOT equivalent to those in the traditional RDBMS. They are not akin to a B-tree index in RDBMS. They are mostly like a hash. So, the range queries do not work on secondary indexes in Cassandra, only equality queries work on secondary indexes.

We can use the CQL CREATE INDEX statement to create an index on a column after we define a table. For example, we might want to add a column sector to indicate the sector that the stock belongs to, as shown in the following screenshot:

Secondary indexes

If we want to search dayquote06 for symbols that belong to Properties, we might run the command, as shown in the following screenshot:

Secondary indexes

As sector is not in the primary key, we cannot query Cassandra directly by sector. Instead, we can create a secondary index on the column sector to make this possible, as shown in the following screenshot:

Secondary indexes

The index name dayquote06_sector_idx is optional, but must be unique within the keyspace. Cassandra assigns a name such as dayquote06_idx if you do not provide a name. We can now query Cassandra for daily stock quotes by sector.

You can see that the columns in the primary key are not present in the WHERE predicate clause in the previous screenshot and Cassandra uses the secondary index to look for the rows matching the selection condition.

Multiple secondary indexes

Cassandra supports multiple secondary indexes on a table. The WHERE clause is executed if at least one column is involved in a secondary index. Thus, we can use multiple conditions in the WHERE clause to filter the results. When multiple occurrences of data match a condition in the WHERE predicate clause, Cassandra selects the least frequent occurrence of a condition to process first so as to have a better query efficiency.

When a potentially expensive query is attempted, such as a range query, Cassandra requires the ALLOW FILTERING clause, which can apply additional filters to the result set for values of other non-indexed columns. It works very slowly because it scans all rows in all nodes. The ALLOW FILTERING clause is used to explicitly direct Cassandra to execute that potentially expensive query on any WHERE clause without creating secondary indexes, despite unpredictability of the performance.

Secondary index do's and don'ts

The secondary index is best on a table that has many rows that contain fewer unique values, that is low cardinality in the relational database terminologies, which is counterintuitive to the relational people. The more unique values that exist in a particular column, the more overhead you will have to query and maintain the index. Hence, it is not suitable for querying a huge volume of records for a small number of results.

Tip

Do index the columns with values that have low cardinality. Cassandra stores secondary indexes only for local rows in the data node as a hash-multimap or as bitmap indexes, you can refer to it at https://issues.apache.org/jira/browse/CASSANDRA-1472.

Secondary indexes should be avoided in the following situations:

  • On high-cardinality columns for a small number of results out of a huge volume of rows

    An index on a high-cardinality column will incur many seeks for very few results. For columns containing unique values, using an index for convenience is fine from a performance perspective, as long as the query volume to the indexed column family is moderate and not under constant load.

  • In tables that use a counter column
  • On a frequently updated or deleted column

    Cassandra stores tombstones (a marker in a row that indicates that a column was deleted. During compaction, marked columns are deleted in the index (a hidden table) until the tombstone limit reaches 100 K cells. After exceeding this limit, the query that uses the indexed value will fail.

  • To look for a row in a large partition

    A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines get added to the cluster.

Tip

Important points to take note of

  • Don't index on high-cardinality columns
  • Don't use index in tables having a counter column
  • Don't index on a frequently updated or deleted column
  • Don't abuse the index to look for a row in a large partition
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset