NRT, flush, refresh, and transaction log

In an ideal search solution, when new data is indexed, it is instantly available for searching. When you start Elasticsearch, this is exactly how it works even in distributed environments. However, this is not the whole truth, and we will show you why it is like this.

Let's start by indexing an example document to the newly created index using the following command:

curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }'

Now, let's replace this document, and let's try to find it immediately. In order to do this, we'll use the following command chain:

curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test2" }' ; curl -XGET 'localhost:9200/test/test/_search?pretty'

The preceding command will probably result in a response that is very similar to the following one:

{"_index":"test","_type":"test","_id":"1","_version":2,"created":false}{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test",
      "_type" : "test",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{ "title": "test" }
    } ]
  }
}

We see two responses glued together. The first line starts with a response to the indexing command—the first command we've sent. As you can see, everything is correct—we've updated the document (look at _version). With the second command, our search query should return the document with the title field set to test2; however, as you can see, it returned the first document. What happened? Before we give you the answer to this question, we will take a step back and discuss how the underlying Apache Lucene library makes the newly indexed documents available for searching.

Updating the index and committing changes

As we already know from the Introducing Apache Lucene section in Chapter 1, Introduction to Elasticsearch, during the indexing process, new documents are written into segments. The segments are independent indices, which means that queries that are run in parallel to indexing should add newly created segments from time to time to the set of these segments that are used for searching. Apache Lucene does this by creating subsequent (because of the write-once nature of the index) segments_N files, which list segments in the index. This process is called committing. Lucene can do this in a secure way—we are sure that all changes or none of them hit the index. If a failure happens, we can be sure that the index will be in a consistent state.

Let's return to our example. The first operation adds the document to the index but doesn't run the commit command to Lucene. This is exactly how it works. However, a commit is not enough for the data to be available for searching. The Lucene library uses an abstraction class called Searcher to access the index, and this class needs to be refreshed.

After a commit operation, the Searcher object should be reopened in order for it to be able to see the newly created segments. This whole process is called refresh. For performance reasons, Elasticsearch tries to postpone costly refreshes and, by default, refresh is not performed after indexing a single document (or a batch of them), but the Searcher is refreshed every second. This happens quite often, but sometimes, applications require the refresh operation to be performed more often than once every second. When this happens, you can consider using another technology, or the requirements should be verified. If required, there is a possibility of forcing the refresh by using the Elasticsearch API. For example, in our example, we can add the following command:

curl -XGET localhost:9200/test/_refresh

If we add the preceding command before the search, Elasticsearch would respond as we had expected.

Changing the default refresh time

The time between automatic Searcher refresh operations can be changed by using the index.refresh_interval parameter either in the Elasticsearch configuration file or by using the Update Settings API, for example:

curl -XPUT localhost:9200/test/_settings -d '{
  "index" : {
    "refresh_interval" : "5m"
  }
}'

The preceding command will change the automatic refresh to be performed every 5 minutes. Please remember that the data that is indexed between refreshes won't be visible by queries.

Note

As we said, the refresh operation is costly when it comes to resources. The longer the period of the refresh, the faster your indexing will be. If you are planning for a very high indexing procedure when you don't need your data to be visible until the indexing ends, you can consider disabling the refresh operation by setting the index.refresh_interval parameter to -1 and setting it back to its original value after the indexing is done.

The transaction log

Apache Lucene can guarantee index consistency and all or nothing indexing, which is great. However, this fact cannot ensure us that there will be no data loss when failure happens while writing data to the index (for example, when there isn't enough space on the device, the device is faulty, or there aren't enough file handlers available to create new index files). Another problem is that frequent commit is costly in terms of performance (as you may recall, a single commit will trigger a new segment creation, and this can trigger the segments to merge). Elasticsearch solves these issues by implementing the transaction log. The transaction log holds all uncommitted transactions and, from time to time, Elasticsearch creates a new log for subsequent changes. When something goes wrong, the transaction log can be replayed to make sure that none of the changes were lost. All of these tasks are happening automatically, so the user may not be aware of the fact that the commit was triggered at a particular moment. In Elasticsearch, the moment where the information from the transaction log is synchronized with the storage (which is the Apache Lucene index) and the transaction log is cleared is called flushing.

Note

Please note the difference between flush and refresh operations. In most of the cases, refresh is exactly what you want. It is all about making new data available for searching. On the other hand, the flush operation is used to make sure that all the data is correctly stored in the index and the transaction log can be cleared.

In addition to automatic flushing, it can be forced manually using the flush API. For example, we can run a command to flush all the data stored in the transaction log for all indices by running the following command:

curl -XGET localhost:9200/_flush

Or, we can run the flush command for the particular index, which in our case is the one called library:

curl -XGET localhost:9200/library/_flush
curl -XGET localhost:9200/library/_refresh

In the second example, we used it together with the refresh, which after flushing the data, opens a new searcher.

The transaction log configuration

If the default behavior of the transaction log is not enough, Elasticsearch allows us to configure its behavior when it comes to the transaction log handling. The following parameters can be set in the elasticsearch.yml file as well as using index settings' Update API to control the transaction log behavior:

  • index.translog.flush_threshold_period: This defaults to 30 minutes (30m). It controls the time after which the flush will be forced automatically even if no new data was being written to it. In some cases, this can cause a lot of I/O operation, so sometimes it's better to perform the flush more often with less data stored in it.
  • index.translog.flush_threshold_ops: This specifies the maximum number of operations after which the flush operation will be performed. By default, Elasticsearch does not limit these operations.
  • index.translog.flush_threshold_size: This specifies the maximum size of the transaction log. If the size of the transaction log is equal to or greater than the parameter, the flush operation will be performed. It defaults to 200 MB.
  • index.translog.interval: This defaults to 5s and describes the period between consecutive checks if the flush is needed. Elasticsearch randomizes this value to be greater than the defined one and less than double of it.
  • index.gateway.local.sync: This defines how often the transaction log should be sent to the disk using the fsync system call. The default is 5s.
  • index.translog.disable_flush: This option allows us to disable the automatic flush. By default, flushing is enabled, but sometimes, it is handy to disable it temporarily, for example, during the import of a large amount of documents.

Note

All of the mentioned parameters are specified for an index of our choice, but they define the behavior of the transaction log for each of the index shards.

In addition to setting the previously mentioned properties in the elasticsearch.yml file, we can also set them by using the Settings Update API. For example, the following command will result in disabling flushing for the test index:

curl -XPUT localhost:9200/test/_settings -d '{
 "index" : {
   "translog.disable_flush" : true
  }
}'

The previous command was run before the import of a large amount of data, which gave us a performance boost for indexing. However, one should remember to turn on flushing when the import is done.

Near real-time GET

Transaction logs give us one more feature for free, that is, the real-time GET operation, which provides us with the possibility of returning the previous version of the document, including noncommitted versions. The real-time GET operation fetches data from the index, but first, it checks whether a newer version of this document is available in the transaction log. If there is no flushed document, the data from the index is ignored and a newer version of the document is returned—the one from the transaction log.

In order to see how it works, you can replace the search operation in our example with the following command:

curl -XGET localhost:9200/test/test/1?pretty

Elasticsearch should return a result similar to the following:

{
     "_index" : "test",
     "_type" : "test",
     "_id" : "1",
     "_version" : 2,
     "exists" : true, "_source" : { "title": "test2" }
}

If you look at the result, you would see that, again, the result was just as we expected and no trick with refresh was required to obtain the newest version of the document.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset