The Scroll API

Let's imagine that we have an index with several million documents. We already know how to build our query and so on. However, when trying to fetch a large number of documents, you see that when getting further and further with pages of the results, the queries slow down and finally timeout or result in memory issues.

The reason for this is that full-text search engines, especially those that are distributed, don't handle paging very well. Of course, getting a few hundred pages of results is not a problem for Elasticsearch, but for going through all the indexed documents or through large result set, a specialized API has been introduced.

Problem definition

When Elasticsearch generates a response, it must determine the order of the documents that form the result. If we are on the first page, this is not a big problem. Elasticsearch just finds the set of documents and collects the first ones; let's say, 20 documents. But if we are on the tenth page, Elasticsearch has to take all the documents from pages one to ten and then discard the ones that are on pages one to nine. This is even more complicated if we have a distributed environment, because we don't know from which nodes the results will come. Because of that, each node needs to build the response and keep it in memory for some time. The problem is not Elasticsearch-specific; a similar situation can be found in the database systems, for example, generally, in every system that uses the so-called priority queue.

Scrolling to the rescue

The solution is simple. Since Elasticsearch has to do some operations (determine the documents for the previous pages) for each request, we can ask Elasticsearch to store this information for subsequent queries. The drawback is that we cannot store this information forever due to limited resources. Elasticsearch assumes that we can declare how long we need this information to be available. Let's see how it works in practice.

First of all, we query Elasticsearch as we usually do. However, in addition to all the known parameters, we add one more: the parameter with the information that we want to use scrolling with and how long we suggest that Elasticsearch should keep the information about the results. We can do this by sending a query as follows:

curl 'localhost:9200/library/_search?pretty&scroll=5m' -d '{
  "size" : 1,
  "query" : {
    "match_all" : { }
  }
}'

The content of this query is irrelevant. The important thing is how Elasticsearch modifies the response. Look at the following first few lines of the response returned by Elasticsearch:

{
  "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxNjo1RDNrYnlfb1JTeU1sX20yS0NRSUZ3OzE3OjVEM2tieV9vUlN5TWxfbTJLQ1FJRnc7MTg6NUQza2J5X29SU3lNbF9tMktDUUlGdzsxOTo1RDNrYnlfb1JTeU1sX20yS0NRSUZ3OzIwOjVEM2tieV9vUlN5TWxfbTJLQ1FJRnc7MDs=",
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    ...

The new part is the _scroll_id section. This is a handle that we will use in the queries that follow. Elasticsearch has a special endpoint for this: the _search/scroll endpoint. Let's look at the following example:

curl -XGET 'localhost:9200/_search/scroll?pretty' -d '{
  "scroll" : "5m",
  "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsyNjo1RDNrYnlfb1JTeU1sX20yS0NRSUZ3OzI3OjVEM2tieV9vUlN5TWxfbTJLQ1FJRnc7Mjg6NUQza2J5X29SU3lNbF9tMktDUUlGdzsyOTo1RDNrYnlfb1JTeU1sX20yS0NRSUZ3OzMwOjVEM2tieV9vUlN5TWxfbTJLQ1FJRnc7MDs="
}'

Now every call to this endpoint with scroll_id returns the next page of results. Remember that this handle is only valid for the defined time of inactivity.

Of course, this solution is not ideal, and it is not very appropriate when there are many requests to random pages of various results or when the time between the requests is difficult to determine. However, you can use this successfully for use cases where you want to get larger result sets, such as transferring data between several systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset