Chapter 8. Improving Performance

In the previous chapter, we looked at the discovery and recovery modules' configuration. We configured these modules and learned why they are important. We also saw additional discovery implementations available through plugins. We used the human-friendly Cat API to get information about the cluster in a human-readable form. We backed up our data to the external cloud storage, and we discussed tribe nodes—a federated search functionality allowing you to connect several Elasticsearch clusters together. By the end of this chapter, you will have learned the following things:

  • What doc values can help us with when it comes to queries that are based on field data cache
  • How garbage collector works
  • How to benchmark your queries and fix performance problems before going to production
  • What is the Hot Threads API and how it can help you with problems' diagnosis
  • How to scale Elasticsearch and what to look at when doing that
  • Preparing Elasticsearch for high querying throughput use cases
  • Preparing Elasticsearch for high indexing throughput use cases

Using doc values to optimize your queries

In the Understanding Elasticsearch caching section of Chapter 6, Low-level Index Control we described caching: one of many ways that allow us to improve Elasticsearch's outstanding performance. Unfortunately, caching is not a silver bullet and, sometimes, it is better to avoid it. If your data is changing rapidly and your queries are very unique and not repeatable, then caching won't really help and can even make your performance worse sometimes.

The problem with field data cache

Every cache is based on a simple principle. The main assumption is that to improve performance, it is worth storing some part of the data in the memory instead of fetching from slow sources such as spinning disks, or to save the system a need to recalculate some processed data. However, caching is not free and it has its price—in terms of Elasticsearch, the cost of caching is mostly memory. Depending on the cache type, you may only need to store recently used data, but again, that's not always possible. Sometimes, it is necessary to hold all the information at once, because otherwise, the cache is just useless. For example, the field data cache used for sorting or aggregations—to make this functionality work, all values for a given field must be uninverted by Elasticsearch and placed in this cache. If we have a large number of documents and our shards are very large, we can be in trouble. The signs of such troubles may be something such as those in the response returned by Elasticsearch when running queries:

{
  "error": "ReduceSearchPhaseException[Failed to execute phase  [fetch], [reduce] ; shardFailures {[vWD3FNVoTy- 64r2vf6NwAw][dvt1][1]: ElasticsearchException[Java heap space];  nested: OutOfMemoryError[Java heap space]; }{[vWD3FNVoTy- 64r2vf6NwAw][dvt1][2]: ElasticsearchException[Java heap space];  nested: OutOfMemoryError[Java heap space]; }]; nested:  OutOfMemoryError[Java heap space]; ",
  "status": 500
}

The other indications of memory-related problems may be present in Elasticsearch logs and look as follows:

[2014-11-29 23:21:32,991][DEBUG][action.search.type       ]  [Abigail Brand] [dvt1][2], node[vWD3FNVoTy-64r2vf6NwAw], [P],  s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@49d609d3]  lastShard [true]
org.elasticsearch.ElasticsearchException: Java heap space
  at org.elasticsearch.ExceptionsHelper.convertToRuntime (ExceptionsHelper.java:46)
  at org.elasticsearch.search.SearchService.executeQueryPhase (SearchService.java:304)
  at org.elasticsearch.search.action. SearchServiceTransportAction$5.call (SearchServiceTransportAction.java:231)
  at org.elasticsearch.search.action. SearchServiceTransportAction$5.call (SearchServiceTransportAction.java:228)
  at org.elasticsearch.search.action. SearchServiceTransportAction$23.run (SearchServiceTransportAction.java:559)
  at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.OutOfMemoryError: Java heap space

This is where doc values can help us. Doc values are data structures in Lucene that are column-oriented, which means that they do not store the data in inverted index but keep them in a document-oriented data structure that is stored on the disk and calculated during the indexation. Because of this, doc values allow us to avoid keeping uninverted data in the field data cache and instead use doc values that access the data from the index, and since Elasticsearch 1.4.0, values are as fast as you would use in the memory field data cache.

The example of doc values usage

To show you the difference in memory consumption between the doc values-based approach and the field data cache-based approach, we indexed some simple documents into Elasticsearch. We indexed the same data to two indices: dvt1 and dvt2. Their structure is identical; the only difference is highlighted in the following code:

{
  "t": {
    "properties": {
      "token": {
        "type": "string",
        "index": "not_analyzed",
        "doc_values": true
      }
    }
  }
}

The dvt2 index uses doc_values, while dtv1 doesn't use it, so the queries run against them (if they use sorting or aggregations) will use the field data cache.

Note

For the purpose of the tests, we've set the JVM heap lower than the default values given to Elasticsearch. The example Elasticsearch instance was run using:

bin/elasticsearch -Xmx16m -Xms16m

This seems somewhat insane for the first sight, but who said that we can't run Elasticsearch on the embedded device? The other way to simulate this problem is, of course, to index way more data. However, for the purpose of the test, keeping the memory low is more than enough.

Let's now see how Elasticsearch behaves when hitting our example indices. The query does not look complicated but shows the problem very well. We will try to sort our data on the basis of our single field in the document: the token type. As we know, sorting requires uninverted data, so it will use either the field data cache or doc values if they are available. The query itself looks as follows:

{
  "sort": [
    {
      "token": {
        "order": "desc"
      }
    }
  ]
}

It is a simple sort, but it is sufficient to take down our server when we try to search in the dvt1 index. At the same time, a query run against the dvt2 index returns the expected results without any sign of problems.

The difference in memory usage is significant. We can compare the memory usage for both indices after restarting Elasticsearch and removing the memory limit from the startup parameters. After running the query against both dvt1 and dvt2, we use the following command to check the memory usage:

curl -XGET 'localhost:9200/dvt1,dvt2/_stats/fielddata?pretty'

The response returned by Elasticsearch in our case was as follows:

{
  "_shards" : {
    "total" : 20,
    "successful" : 10,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "fielddata" : {
        "memory_size_in_bytes" : 17321304,
        "evictions" : 0
      }
    },
    "total" : {
      "fielddata" : {
        "memory_size_in_bytes" : 17321304,
        "evictions" : 0
      }
    }
  },
  "indices" : {
    "dvt2" : {
      "primaries" : {
        "fielddata" : {
          "memory_size_in_bytes" : 0,
          "evictions" : 0
        }
      },
      "total" : {
        "fielddata" : {
          "memory_size_in_bytes" : 0,
          "evictions" : 0
        }
      }
    },
    "dvt1" : {
      "primaries" : {
        "fielddata" : {
          "memory_size_in_bytes" : 17321304,
          "evictions" : 0
        }
      },
      "total" : {
        "fielddata" : {
          "memory_size_in_bytes" : 17321304,
          "evictions" : 0
        }
      }
    }
  }
}

The most interesting parts are highlighted. As we can see, the indexes without doc_values use 17321304 bytes (16 MB) of memory for the field data cache. At the same time, the second index uses nothing; exactly no RAM memory is used to store the uninverted data.

Of course, as with most optimizations, doc values are not free to use when it comes to resources. Among the drawbacks of using doc values are speed—doc values are slightly slower compared to field data cache. The second drawback is the additional space needed for doc_values. For example, in our simple test case, the index with doc values was 41 MB, while the index without doc values was 34 MB. This gives us a bit more than 20 percent increase in the index size, but that usually depends on the data you have in your index. However, remember that if you have memory problems related to queries and field data cache, you may want to turn on doc values, reindex your data, and not worry about out-of-memory exceptions related to the field data cache anymore.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset