Performance tuning is a large and complex topic that in itself can be a whole course. We can only scratch the surface of it in this short section. Similar to monitoring in the last section, operating system-specific performance tuning techniques are beyond the scope of this book.
Based on the information given by the monitoring tools and the system log, we can discover opportunities for performance tuning. The first things we usually watch are the Java heap memory and garbage collection. JVM's configuration settings are controlled in the environment settings file for Cassandra, cassandra-env.sh
, located in /etc/cassandra/
. An example is shown in the following screenshot:
Basically, it already has the boilerplate options calculated to be optimized for the host system. It is also accompanied with explanation for us to tweak specific JVM parameters and the startup options of a Cassandra instance when we experience real issues; otherwise, these boilerplate options should not be altered.
A detailed documentation on how to tune JVM for Cassandra can be found at http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_tune_jvm_c.html.
Another area we should pay attention to is caching. Cassandra includes integrated caching and distributes cache data around the cluster. For a cache specific to a table, we will focus on the partition key cache and the row cache.
The partition key cache, or key cache for short, is a cache of the partition index for a table. Using the key cache saves processor time and memory. However, enabling just the key cache makes the disk activity actually read the requested data rows.
The row cache is similar to a traditional cache. When a row is accessed, the entire row is pulled into memory, merging from multiple SSTables when required, and cached. This prevents Cassandra from retrieving that row using disk I/O again, which can tremendously improve read performance.
When both row cache and partition key cache are configured, the row cache returns results whenever possible. In the event of a row cache miss, the partition key cache might still provide a hit that makes the disk seek much more efficient.
However, there is one caveat. Cassandra caches all the rows of a partition when reading that partition. So if the partition is large or only a small portion of the partition is read every time, the row cache might not be beneficial. It is very easy to be misused and consequently the JVM will be exhausted, causing Cassandra to fail. That is why the row cache is disabled by default.
Either the nodetool info
command or JMX MBeans can provide assistance in monitoring cache. We should make changes to cache options in small, incremental adjustments, and then monitor the effects of each change using the nodetool utility. The last two lines of output of the nodetool info
command, as seen in the following figure, contain the Row Cache
and Key Cache
metrics of ubtc02
:
In the event of high memory consumption, we can consider tuning data caches.
We use the CQL to enable or disable caching by altering the cache property of a table. For instance, we use the ALTER TABLE
statement to enable the row cache for watchlist
:
ALTER TABLE watchlist WITH caching=''ROWS_ONLY'';
Other available table caching options include ALL
, KEYS_ONLY
and NONE
. They are quite self-explanatory and we do not go through each of them here.
Further information about data caching can be found at http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_configuring_caches_c.html.