Understanding Elasticsearch caching

One of the very important parts of Elasticsearch, although not always visible to the users, is caching. It allows Elasticsearch to store commonly used data in memory and reuse it on demand. Of course, we can't cache everything—we usually have way more data than we have memory, and creating caches may be quite expensive when it comes to performance. In this chapter, we will look at the different caches exposed by Elasticsearch, and we will discuss how they are used and how we can control their usage. Hopefully, such information will allow you to better understand how this great search server works internally.

The filter cache

The filter cache is the simplest of all the caches available in Elasticsearch. It is used during query time to cache the results of filters that are used in queries. We already talked about filters in section Handling filters and why it matters of Chapter 2, Power User Query DSL, but let's look at a simple example. Let's assume that we have the following query:

{
 "query" : {
  "filtered" : {
   "query" : {
    "match_all" : {}
   },
   "filter" : {
    "term" : {
     "category" : "romance"
    }
   }
  }
 }
}

The preceding query will return all the documents that have the romance term in the category field. As you can see, we've used the match_all query combined with a filter. Now, after the initial query, every query with the same filter present in it will reuse the results of our filter and save the precious I/O and CPU resources.

Filter cache types

There are two types of filter caches available in Elasticsearch: node-level and index-level filter caches. This gives us the possibility of choosing the filter cache to be dependent on the index or on a node (which is the default behavior). As we can't always predict where the given index will be allocated (actually, its shards and replicas), it is not recommended that you use the index-level filter cache because we can't predict the memory usage in such cases.

Node-level filter cache configuration

The default and recommended filter cache type is configured for all shards allocated to a given node (set using the index.cache.filter.type property to the node value or not setting that property at all). Elasticsearch allows us to use the indices.cache.filter.size property to configure the size of this cache. We can either use a percentage value as 10% (which is the default value), or a static memory value as 1024mb. If we use the percentage value, Elasticsearch will calculate it as a percentage of the maximum heap memory given to a node.

The node-level filter cache is a Least Recently Used cache type (LRU), which means that while removing cache entries, the ones that were used the least number of times will be thrown away in order to make place for the newer entries.

Index-level filter cache configuration

The second type of filter cache that Elasticsearch allows us to use is the index-level filter cache. We can configure its behavior by configuring the following properties:

  • index.cache.filter.type: This property sets the type of the cache, which can take the values of resident, soft, weak, and node (the default one). By using this property, Elasticsearch allows us to choose the implementation of the cache. The entries in the resident cache can't be removed by JVM unless we want them to be removed (either by using the API or by setting the maximum size or expiration time) and is basically recommended because of this (filling up the filter cache can be expensive). The soft and weak filter cache types can be cleared by JVM when it lacks memory, with the difference that when clearing up memory, JVM will choose the weaker reference objects first and then choose the one that uses the soft reference. The node value tells Elasticsearch to use the node-level filter cache.
  • index.cache.filter.max_size: This property specifies the maximum number of cache entries that can be stored in the filter cache (the default is -1, which means unbounded). You need to remember that this setting is not applied for the whole index but for a single segment of a shard for the index, so the memory usage will differ depending on how many shards (and replicas) there are (for the given index) and how many segments the index contains. Generally, the default, unbounded filter cache is fine with the soft type and the proper queries that are paying attention in order to make the caches reusable.
  • index.cache.filter.expire: This property specifies the expiration time of an entry in the filter cache, which is unbounded (set to -1) by default. If we want our filter cache to expire if not accessed, we can set the maximum time of inactivity. For example, if we would like our cache to expire after 60 minutes, we should set this property to 60m.

Note

If you want to read more about the soft and weak references in Java, please refer to the Java documentation, especially the Javadocs, for these two types: http://docs.oracle.com/javase/8/docs/api/java/lang/ref/SoftReference.html and http://docs.oracle.com/javase/8/docs/api/java/lang/ref/WeakReference.html.

The field data cache

The field data cache is used when we want to send queries that involve operations that work on uninverted data. What Elasticsearch needs to do is load all the values for a given field and store that in the memory—you can call this field data cache. This cache is used by Elasticsearch when we use faceting, aggregations, scripting, or sorting on the field value. When first executing an operation that requires data uninverting, Elasticsearch loads all the data for that field into the memory. Yes, that's right; all the data from a given field is loaded into the memory by default and is never removed from it. Elasticsearch does this to be able to provide fast document-based access to values in a field. Remember that the field data cache is usually expensive to build from the hardware resource's point of view, because the data for the whole field needs to be loaded into the memory, and this requires both I/O operations and CPU resources.

Note

One should remember that for every field that we sort on or use faceting on, the data needs to be loaded into the memory each and every term. This can be expensive, especially for the fields that are high cardinality ones: the ones with numerous different terms in them.

Field data or doc values

Lucene doc values and their implementation in Elasticsearch is getting better and better with each release. With the release of Elasticsearch 1.4.0, they are almost, or as fast as, the field data cache. The thing is that doc values are calculated during indexing time and are stored on the disk along with the index, and they don't require as much memory as the field data cache. In fact, they require very little heap space and are almost as fast as the field data cache. If you are using operations that require large amounts of field data cache, you should consider using doc values for such fields. You only need to add the doc_values property and set it to true for such fields, and Elasticsearch will do the rest.

Note

At the time of writing this, Elasticsearch does not allow using doc values on analyzed string fields. You can use doc values with all the other field types.

For example, if we would like to set our year field to use doc values, we would change its configuration to the following one:

"year" : {
 "type" : "long",
 "ignore_malformed" : false,
 "index" : "analyzed",
 "doc_values" : true
}

If you reindex your data, Elasticsearch would use the doc values (instead of the field data cache) for the operations that require uninverted data in the year field, for example, aggregations.

Node-level field data cache configuration

Since Elasticsearch 0.90.0, we are allowed to use the following properties to configure the node-level field data cache, which is the default field data cache if we don't alter the configuration:

  • indices.fielddata.cache.size: This specifies the maximum size of the field data cache either as a percentage value such as 20%, or an absolute memory size such as 10gb. If we use the percentage value, Elasticsearch will calculate it as a percentage of the maximum heap memory given to a node. By default, the field data cache size is unbounded and should be monitored, as it can consume a vast amount of memory given to the JVM.
  • indices.fielddata.cache.expire: This property specifies the expiration time of an entry in the field data cache, which is set to -1 by default, which means that the entries in the cache won't be expired. If we want our field data cache to expire if not accessed, we can set the maximum time of inactivity. For example, if we like our cache to expire after 60 minutes, we should set this property to 60m. Please remember that the field data cache is very expensive to rebuild, and the expiration should be considered with caution.

Note

If we want to be sure that Elasticsearch will use the node-level field data cache, we should set the index.fielddata.cache.type property to the node value or not set that property at all.

Index-level field data cache configuration

Similar to index-level filter cache, we can also use the index-level field data cache, but again, it is not recommended that you do because of the same reasons: it is hard to predict which shards or which indices will be allocated to which nodes. Because of this, we can't predict the amount of memory that will be used for the field data cache for each index, and we can run into memory-related issues when Elasticsearch does the rebalancing, for example.

However, if you know what you are doing and what you want to use—resident or soft field data cache—you can use the index.fielddata.cache.type property and set it to resident or soft. As we already discussed during the filter cache's description, the entries in the resident cache can't be removed by JVM unless we want them to be, and it is basically recommended that you use this cache type when we want to use the index-level field data cache. Rebuilding the field data cache is expensive and will affect the Elasticsearch query's performance. The soft field data cache types can be cleared by JVM when it lacks memory.

The field data cache filtering

In addition to the previously mentioned configuration options, Elasticsearch allows us to choose which field values are loaded into the field data cache. This can be useful in some cases, especially if you remember that sorting, faceting, and aggregations use the field data cache to calculate the results. Elasticsearch allows us to use three types of field data loading filtering: by term frequency, by using regex, or a combination of both methods.

Let's talk about one of the examples where field data filtering can be useful and where you may want to exclude the terms with lower frequency from the results of faceting. For example, we may need to do this because we know that we have some terms in the index that have spelling mistakes, and these are lower cardinality terms for sure. We don't want to bother calculating aggregations for them, so we can remove them from the data, correct them in our data source, or remove them from the field data cache by filtering. This will not only exclude them from the results returned by Elasticsearch, but it will also make the field data memory footprint lower, because less data will be stored in the memory. Now let's look at the filtering possibilities.

Adding field data filtering information

In order to introduce the field data cache filtering information, we need to add an additional object to our mappings field definition: the fielddata object with its child object—filter. So our extended field definition for some abstract tag field would look as follows:

"tag" : {
 "type" : "string",
 "index" : "not_analyzed",
 "fielddata" : {
  "filter" : {
  ...
  }
 }
}

We will see what to put in the filter object in the upcoming sections.

Filtering by term frequency

Filtering by term frequency allows us to only load the terms that have a frequency higher than the specified minimum (the min parameter) and lower than the specified maximum (the max parameter). The term frequency bounded by the min and max parameters is not specified for the whole index but per segment, which is very important, because these frequencies will differ. The min and max parameters can be specified either as a percentage (for example, 1 percent is 0.01 and 50 percent is 0.5), or as an absolute number.

In addition to this, we can include the min_segment_size property that specifies the minimum number of documents a segment should contain in order to be taken into consideration while building the field data cache.

For example, if we would like to store only the terms that come from segments with at least 100 documents and the terms that have a segment term frequency between 1 percent to 20 percent in the field data cache, we should have mappings similar to the following ones:

{
 "book" : {
  "properties" : {
   "tag" : {
    "type" : "string",
    "index" : "not_analyzed",
    "fielddata" : {
     "filter" : {
      "frequency" : {
       "min" : 0.01,
       "max" : 0.2,
       "min_segment_size" : 100
      }
     }
    }
   }
  }
 }
}

Filtering by regex

In addition to filtering by the term frequency, we can also filter by the regex expression. In such a case, only the terms that match the specified regex will be loaded into the field data cache. For example, if we only want to load the data from the tag field, which probably has Twitter tags (starting with the # character), we should have the following mappings:

{
 "book" : {
  "properties" : {
   "tag" : {
    "type" : "string",
    "index" : "not_analyzed",
    "fielddata" : {
     "filter" : {
      "regex" : "^#.*"
     }
    }
   }
  }
 }
}

Filtering by regex and term frequency

Of course, we can combine the previously discussed filtering methods. So, if we want to have the field data cache responsible for holding the tag field data of only those terms that start with the # character, this comes from a segment with at least 100 documents and has a segment term frequency between 1 to 20 percent; we should have the following mappings:

{
 "book" : {
  "properties" : {
   "tag" : {
    "type" : "string",
    "index" : "not_analyzed",
    "fielddata" : {
     "filter" : {
      "frequency" : {
       "min" : 0.1,
       "max" : 0.2,
       "min_segment_size" : 100
      },
      "regex" : "^#.*"
     }
    }
   }
  }
 }
}

Note

Remember that the field data cache is not built during indexing but can be rebuilt while querying and, because of that, we can change filtering during runtime by updating the fieldata section using the Mappings API. However, one has to remember that after changing the field data loading filtering settings, the cache should be cleared using the clear cache API described in the Clearing the caches section in this chapter.

The filtering example

So now, let's go back to the example from the beginning of the filtering section. What we want to do is exclude the terms with the lowest frequency from faceting results. In our case, the lowest ones are the ones that have the frequency lower than 50 percent. Of course, this frequency is very high, but in our example, we only use four documents. In production, you'd like to have different values: lower ones. In order to do this, we will create a books index with the following commands:

curl -XPOST 'localhost:9200/books' -d '{
 "settings" : {
  "number_of_shards" : 1,
  "number_of_replicas" : 0
 },
 "mappings" : {
  "book" : {
   "properties" : {
    "tag" : {
     "type" : "string",
     "index" : "not_analyzed",
     "fielddata" : {
      "filter" : {
       "frequency" : {
        "min" : 0.5,
        "max" : 0.99
       }
      }
     }
    }
   }
  }
 }
}'

Now, let's index some sample documents using the bulk API (the code is stored in the regex.json file provided with the book):

curl -s -XPOST 'localhost:9200/_bulk' --data-binary '
{ "index": {"_index": "books", "_type": "book", "_id": "1"}}
{"tag":["one"]}
{ "index": {"_index": "books", "_type": "book", "_id": "2"}}
{"tag":["one"]}
{ "index": {"_index": "books", "_type": "book", "_id": "3"}}
{"tag":["one"]}
{ "index": {"_index": "books", "_type": "book", "_id": "4"}}
{"tag":["four"]}
'

Now, let's check a simple term's faceting by running the following query (because as we already discussed, faceting and aggregations use the field data cache to operate):

curl -XGET 'localhost:9200/books/_search?pretty' -d ' {
 "query" : {
  "match_all" : {}
 },
 "aggregations" : {
  "tag" : {
   "terms" : {
    "field" : "tag"
   }
  }
 }
}'

The response for the preceding query would be as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  .
  .
  .
  "aggregations" : {
  "tag" : {
         "doc_count_error_upper_bound" : 0,
         "sum_other_doc_count" : 0,
         "buckets" : [ {
           "key" : "one",
"doc_count" : 3 }]
}
}

As you can see, the terms aggregation was only calculated for the one term, and the four term was omitted. If we assume that the four term was misspelled, then we have achieved what we wanted.

Field data formats

Field data cache is not a simple functionality and is implemented to save as much memory as possible. Because of this, Elasticsearch exposes a few formats for the field data cache depending on the data type. We can set the format of the internal data stored in the field data cache by specifying the format property inside a fielddata object for a field, for example:

"tag" : {
 "type" : "string",
 "fielddata" : {
  "format" : "paged_bytes"
 }
}

Let's now look at the possible formats.

String-based fields

For string-based fields, Elasticsearch exposes three formats of the field data cache. The default format is paged_bytes, which stores unique occurrences of the terms sequentially and maps documents to these terms. This data is stored in the memory. The second format is fst, which stores the field data cache in a structure called Finite State Transducer (FSThttp://en.wikipedia.org/wiki/Finite_state_transducer), which results in lower memory usage compared to the default format, but it is also slower compared to it. Finally, the third format is doc_values, which results in computing the field data cache entries during indexing and storing them on the disk along with the index files. This format is almost as fast as the default one, but its memory footprint is very low. However, it can't be used with analyzed string fields. Field data filtering is not supported for the doc_values format.

Numeric fields

For numeric-based fields, we have two options when it comes to the format of the field data cache. The default array format stores the data in an in-memory array. The second type of format is doc_values, which uses doc values to store the field data, which means that the field data cache entries will be computed during indexing and stored on the disk along with the index files. Field data filtering is not supported for the doc_values format.

Geographical-based fields

For geo-point based fields, we have options similar to the numeric fields: the default array format, which stores longitudes and latitudes in an array, or doc_values, which uses doc values to store the field data. Of course, field data filtering is not supported for the doc_values format.

Field data loading

In addition to what we wrote already, Elasticsearch allows us to configure how the field data cache is loaded. As we already mentioned, the field data cache is loaded by default when the cache is needed for the first time—during the first query execution that needs uninverted data. We can change this behavior by including the loading property and setting it to eager. This will make Elasticsearch load the field data cache eagerly whenever new data appears to be loaded into the cache. Therefore, to make the field data cache for the tag field to be loaded eagerly, we would configure it the following way:

"tag" : {
 "type" : "string",
 "fielddata" : {
  "loading" : "eager"
 }
}

We can also completely disable the field data cache loading by setting the format property to disabled. For example, to disable loading the field data cache for our tag field, we can change its configuration to the following one:

"tag" : {
 "type" : "string",
 "fielddata" : {
  "format" : "disabled"
 }
}

Please note that functionalities that require uninverted data (such as aggregations) won't work on such defined fields.

The shard query cache

A new cache introduced in Elasticsearch 1.4.0 can help with query performance. The shard query cache is responsible for caching local results for each shard. As you remember, when Elasticsearch executes a query, it is sent to all the relevant shards and is executed on them. The results are returned to the node that requested them and are combined. The shard query cache is about caching these partial results on the shard level.

Note

At the time of writing this, the only cached search_type query was count. Therefore, the documents returned by the query will not be cached, but the total number of hits, aggregations, and suggestions returned by each shard will be cached, speeding up proceeding queries. Note that this is likely to be changed in future versions of Elasticsearch.

The shard query cache is not enabled by default. However, we have two options that show you how to enable it. We can do this by adding the index.cache.query.enable property and setting it to true in the settings of our index or by updating the indices settings in real-time with a command like this:

curl -XPUT 'localhost:9200/mastering/_settings' -d '{
 "index.cache.query.enable" : true
}'

The second option is to enable the shard query cache per request. We can do this by using the query_cache URI parameter set to true on a per-query basis. The thing to remember is that passing this parameter overwrites the index-level settings. An example request could look as follows:

curl -XGET 'localhost:9200/books/_search?search_type=count&query_cache=true' -d '{
 "query" : {
  "match_all" : {}
 },
 "aggregations" : {
  "tags" : {
   "terms" : {
    "field" : "tag"
   }
  }
 }
}'

The good thing about shard query cache is that it is invalidated and updated automatically. Whenever a shard's contents changes, Elasticsearch will update the contents of the cache automatically, so the results of the cached and not cached query will always be the same.

Setting up the shard query cache

By default, Elasticsearch will use up to 1 percent of the heap size given to a node for the shard query cache. This means that all indices present on a node can use up to 1 percent of the total heap memory for the query cache. We can change this by setting the indices.cache.query.size property in the elasticsearch.yml file.

In addition to this, we can control the expiration time of the cache by setting the indices.cache.query.expire property. For example, if we would like the cache to be automatically expired after 60 minutes, we should set the property to 60m.

Using circuit breakers

Because queries can put a lot of pressure on Elasticsearch resources, they allow us to use so-called circuit breakers that prevent Elasticsearch from using too much memory in certain functionalities. Elasticsearch estimates the memory usage and rejects the query execution if certain thresholds are met. Let's look at the available circuit breakers and what they can help us with.

The field data circuit breaker

The field data circuit breaker will prevent request execution if the estimated memory usage for the request is higher than the configured values. By default, Elasticsearch sets indices.breaker.fielddata.limit to 60%, which means that no more than 60 percent of the JVM heap is allowed to be used for the field data cache.

We can also configure the multiplier that Elasticsearch uses for estimates (the estimated values are multiplied by this property value) by using the indices.breaker.fielddata.overhead property. By default, it is set to 1.03.

Note

Please note than before Elasticsearch 1.4.0, indices.breaker.fielddata.limit was called indices.fielddata.breaker.limit and indices.breaker.fielddata.overhead was called indices.fielddatabreaker.overhead.

The request circuit breaker

Introduced in Elasticsearch 1.4.0, the request circuit breaker allows us to configure Elasticsearch to reject the execution of the request if the total estimated memory used by it will be higher than the indices.breaker.request.limit property (set to 40% of the total heap memory assigned to the JVM by default).

Similar to the field data circuit breaker, we can set the overhead by using the indices.breaker.request.overhead property, which defaults to 1.

The total circuit breaker

In addition to the previously described circuit breakers, Elasticsearch 1.4.0 introduced a notion of the total circuit breaker, which defines the total amount of memory that can be used along all the other circuit breakers. We can configure it using indices.breaker.total.limit, and it defaults to 70% of the JVM heap.

Note

Please remember that all the circuit breakers can be dynamically changed on a working cluster using the Cluster Update Settings API.

Clearing the caches

As we've mentioned earlier, sometimes it is necessary to clear the caches. Elasticsearch allows us to clear the caches using the _cache REST endpoint. Let's look at the usage possibilities.

Index, indices, and all caches clearing

The simplest thing we can do is just clear all the caches by running the following command:

curl -XPOST 'localhost:9200/_cache/clear'

Of course, as we are used to, we can choose a single index or multiple indices to clear the caches for them. For example, if we want to clear the cache for the mastering index, we should run the following command:

curl -XPOST 'localhost:9200/mastering/_cache/clear'

If we want to clear caches for the mastering and books indices, we should run the following command:

curl -XPOST 'localhost:9200/mastering,books/_cache/clear'

Clearing specific caches

By default, Elasticsearch clears all the caches when running the cache clear request. However, we are allowed to choose which caches should be cleared and which ones should be left alone. Elasticsearch allows us to choose the following behavior:

  • Filter caches can be cleared by setting the filter parameter to true. In order to exclude this cache type from the clearing one, we should set this parameter to false. Note that the filter cache is not cleared immediately, but it is scheduled by Elasticsearch to be cleared in the next 60 seconds.
  • The field data cache can be cleared by setting the field_data parameter to true. In order to exclude this cache type from the clearing one, we should set this parameter to false.
  • To clear the caches of identifiers used for parent-child relationships, we can set the id_cache parameter to true. Setting this property to false will exclude that cache from being cleared.
  • The shard query cache can be cleared by setting the query_cache parameter to true. Setting this parameter to false will exclude the shard query cache from being cleared.

For example, if we want all caches apart from the filter and shard query caches for the mastering index, we could run the following command:

curl -XPOST 'localhost:9200/mastering/_cache/clear?field_data=true&filter=false&query_cache=false'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset