Chapter 8. Dealing with Problems

In the previous chapter, we looked at cluster health and state monitoring possibilities by using ElasticSearch API as well as third-party tools. We learned what the discovery module is and how to configure it. In addition to that, we learned how to control shard and replica allocation and how to install additional plugins to our ElasticSearch instances. We also saw what each gateway module is responsible for and which configuration options we can use.

In this chapter, we will take a look at how to efficiently fetch a large amount of data from ElasticSearch. We will discuss the ability to control cluster shard rebalancing and how to validate our ElasticSearch queries. In addition to that, we will see how to use the new warming up functionality to improve the performance of our queries. By the end of this chapter, you will have learned the following:

  • How to use scrolling for fetching a large number of results efficiently
  • How to control cluster rebalancing
  • How to validate your queries
  • How to use the warming up functionality

Why is the result on later pages slow

Let's imagine that we have an index with several millions of documents. We already know how to build our query, when to use filters, and so on. But looking at query logs, we see that particular kinds of queries are significantly slower than the other ones. These queries may be using paging. The from parameter indicates that the offsets have large values. From the application side, this can mean that users go through an enormous number of results. Often this doesn't make sense—if a user doesn't find desirable results on first few pages, he/she gives up. Because this particular activity can mean something bad (possible data theft), many applications limit paging to dozens of pages. In our case, we assume that this is a different scenario and we have to provide this functionality.

What is the problem?

When ElasticSearch generates a response, it must determine the order of documents forming the result. If we are on the first page, this is not a big problem. ElasticSearch just finds the set of documents and collects the first ones, let's say 20 documents. But if we are on the tenth page, ElasticSearch has to take all the documents for pages 1 to 10 and then discard the ones that are on pages 1 to 9. The problem is not ElasticSearch-specific; a similar situation can be found in database systems, for example.

Scrolling to the rescue

The solution is simple. Since ElasticSearch has to do some operation (determine documents for previous pages) for each request, we can ask it to store this information for the subsequent queries. The drawback is that we cannot store this information forever due to limited resources. ElasticSearch assumes that we can declare how long we need this information to be available. Let's see how it works in practice. First of all, we query ElasticSearch as we usually do, but in addition to all the usual parameters, we add one more—the parameter with information for which we want to use scrolling and how long we suggest that ElasticSearch should keep the information about results:

curl 'localhost:9200/library/_search?pretty&scroll=5m' -d @query.json

The content of this query is irrelevant. The important thing is how ElasticSearch modifies the reply. Look at the first few lines of the response returned by ElasticSearch:

{
  "_scroll_id" : 
  "cXVlcnlUaGVuRmV0Y2g7NTsxMDI6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMD
  U6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMDQ6dklNMlkzTG1RTDJ2b25oTDNEN
  mJzZzsxMDE6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMDM6dklNMlkzTG1RTDJ
  2b25oTDNENmJzZzswOw==",
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1341211,
    …

The new part is _scroll_id. This is a handle that we will use in the next queries. ElasticSearch has a special endpoint for this. Let's look at the following example:

curl -XGET 'localhost:9200/_search/scroll?scroll=5m&pretty&scroll_id=cXVlcnlUaGVuRmV0Y2g7NTsxMjg6dklNlkzTG1RTDJ2b25oTDNENmJzZzsxMjk6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMzA6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMjc6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMjY6dklNMlkzTG1RTDJ2b25oTDNENmJzZzswOw=='

Now, every call to this endpoint with scroll_id returns the next page of results. Remember that this handle is only valid until the defined time-out. After this time, you can see an error response similar to the following:

{
  "_scroll_id" : 
  "cXVlcnlUaGVuRmV0Y2g7NTsxMjg6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMj
  k6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMzA6dklNMlkzTG1RTDJ2b25oTDNEN
  mJzZzsxMjc6dklNMlkzTG1RTDJ2b25oTDNENmJzZzsxMjY6dklNMlkzTG1RTDJ2
  b25oTDNENmJzZzswOw==",
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 0,
    "failed" : 5,
    "failures" : [ {
      "status" : 500,
      "reason" : "SearchContextMissingException[No search context 
      found for id [128]]"
    }, {
      "status" : 500,
      "reason" : "SearchContextMissingException[No search context 
      found for id [126]]"
    }, {
      "status" : 500,
      "reason" : "SearchContextMissingException[No search context 
      found for id [127]]"
    }, {
      "status" : 500,
      "reason" : "SearchContextMissingException[No search context 
      found for id [130]]"
    }, {
      "status" : 500,
      "reason" : "SearchContextMissingException[No search context 
      found for id [129]]"
    } ]
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

As you may think, this solution is not ideal and it is not suited when there are many requests to random pages of various results or time between requests is difficult to determine. But you can use this with success, for example, when implementing data transfer between several systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset