One of the most desired functionalities in Elasticsearch was always a feature called document folding or document grouping. This functionality was the most +1 marked issue for Elasticsearch. It is not surprising at all. It is sometimes very convenient to show a list of documents grouped by a particular value, especially when the number of results is very big. In this case, instead of showing all the documents one by one, we would return only one (or a few) documents from every group. For example, in our library, we could prepare a query returning all the documents about wildlife sorted by publishing date, but limit the list to two books from every year. The other useful use case, where grouping can become very handy, is counting and showing distinct values in a field. An example of such behavior is returning only a single book that had many editions.
The top_hits
aggregation was introduced in Elasticsearch 1.3 along with the changes to scripting about which we will talk in the Scripting changes section later in this chapter. What is interesting is that we can force Elasticsearch to provide grouping functionality with this aggregation. In fact, it seems that a document folding is more or less a side effect and only one of the possible usage examples of the top_hits
aggregation. In this section, we will only focus on how this particular aggregation works, and we assumed that you already know the basic rules that rule the world of the Elasticsearch aggregation framework.
If you don't have any experience with this Elasticsearch functionality, please considering looking at Elasticsearch Server Second Edition published by Packt Publishing or reading the Elasticsearch documentation page available at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations.html.
The idea behind the top_hits
aggregation is simple. Every document that is assigned to a particular bucket can be also remembered. By default, only three documents per bucket are remembered. Let's see how it works using our example library
index.
To show you a potential use case that leverages the top_hits
aggregation, we decided to use the following query:
curl -XGET "http://127.0.0.1:9200/library/_search?pretty" -d' { "size": 0, "aggs": { "when": { "histogram": { "field": "year", "interval": 100 }, "aggs": { "book": { "top_hits": { "_source": { "include": [ "title", "available" ] }, "size": 1 } } } } } }'
In the preceding example, we did the histogram
aggregation on year ranges. Every bucket is created for every 100 years. The nested top_hits
aggregations will remember a single document with the greatest score from each bucket (because of the size
property set to 1
). We added the include
option only for simplicity of the results, so that we only return the title
and available
fields for every aggregated document. The response returned by Elasticsearch should be similar to the following one:
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 4, "max_score": 0, "hits": [] }, "aggregations": { "when": { "buckets": [ { "key_as_string": "1800", "key": 1800, "doc_count": 1, "book": { "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "library", "_type": "book", "_id": "4", "_score": 1, "_source": { "title": "Crime and Punishment", "available": true } } ] } } }, { "key_as_string": "1900", "key": 1900, "doc_count": 3, "book": { "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "library", "_type": "book", "_id": "3", "_score": 1, "_source": { "title": "The Complete Sherlock Holmes", "available": false } } ] } } } ] } } }
The interesting parts of the response are highlighted. We can see that because of the top_hits
aggregation, we have the most scoring document (from each bucket) included in the response. In our particular case, the query was the match_all
one and all the documents have the same score, so the top scoring document for every bucket is more or less random. Elasticsearch used the match_all
query because we didn't specify any query at all—this is the default behavior. If we want to have a custom sorting, this is not a problem for Elasticsearch. For example, we can return the first book from a given century. What we just need to do is add a proper sorting option, just like in the following query:
curl -XGET 'http://127.0.0.1:9200/library/_search?pretty' -d '{ "size": 0, "aggs": { "when": { "histogram": { "field": "year", "interval": 100 }, "aggs": { "book": { "top_hits": { "sort": { "year": "asc" }, "_source": { "include": [ "title", "available" ] }, "size": 1 } } } } } }'
Please take a look at the highlighted fragment of the preceding query. We've added sorting to the top_hits
aggregation, so the results are sorted on the basis of the year
field. This means that the first document will be the one with the lowest value in that field and this is the document that is going to be returned for each bucket.
However, sorting and field inclusion is not everything that we can we do inside the top_hits
aggregation. Elasticsearch allows using several other functionalities related to documents retrieval. We don't want to discuss them all in detail because you should be familiar with most of them if you are familiar with the Elasticsearch aggregation module. However, for the purpose of this chapter, let's look at the following example:
curl -XGET 'http://127.0.0.1:9200/library/_search?pretty' -d '{ "query": { "filtered": { "query": { "match": { "_all": "quiet" } }, "filter": { "term": { "copies": 1, "_name": "copies_filter" } } } }, "size": 0, "aggs": { "when": { "histogram": { "field": "year", "interval": 100 }, "aggs": { "book": { "top_hits": { "highlight": { "fields": { "title": {} } }, "explain": true, "version": true, "_source": { "include": [ "title", "available" ] }, "fielddata_fields" : ["title"], "script_fields": { "century": { "script": "(doc["year"].value / 100).intValue()" } }, "size": 1 } } } } } }'
As you can see, our query contains the following functionalities:
copies_filter
)