Documents grouping

One of the most desired functionalities in Elasticsearch was always a feature called document folding or document grouping. This functionality was the most +1 marked issue for Elasticsearch. It is not surprising at all. It is sometimes very convenient to show a list of documents grouped by a particular value, especially when the number of results is very big. In this case, instead of showing all the documents one by one, we would return only one (or a few) documents from every group. For example, in our library, we could prepare a query returning all the documents about wildlife sorted by publishing date, but limit the list to two books from every year. The other useful use case, where grouping can become very handy, is counting and showing distinct values in a field. An example of such behavior is returning only a single book that had many editions.

Top hits aggregation

The top_hits aggregation was introduced in Elasticsearch 1.3 along with the changes to scripting about which we will talk in the Scripting changes section later in this chapter. What is interesting is that we can force Elasticsearch to provide grouping functionality with this aggregation. In fact, it seems that a document folding is more or less a side effect and only one of the possible usage examples of the top_hits aggregation. In this section, we will only focus on how this particular aggregation works, and we assumed that you already know the basic rules that rule the world of the Elasticsearch aggregation framework.

Note

If you don't have any experience with this Elasticsearch functionality, please considering looking at Elasticsearch Server Second Edition published by Packt Publishing or reading the Elasticsearch documentation page available at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations.html.

The idea behind the top_hits aggregation is simple. Every document that is assigned to a particular bucket can be also remembered. By default, only three documents per bucket are remembered. Let's see how it works using our example library index.

An example

To show you a potential use case that leverages the top_hits aggregation, we decided to use the following query:

curl -XGET "http://127.0.0.1:9200/library/_search?pretty" -d'
{
  "size": 0,
  "aggs": {
    "when": {
      "histogram": {
        "field": "year",
        "interval": 100
      },
      "aggs": {
        "book": {
          "top_hits": {
            "_source": {
              "include": [
                "title",
                "available"
              ]
            },
            "size": 1
          }
        }
      }
    }
  }
}'

In the preceding example, we did the histogram aggregation on year ranges. Every bucket is created for every 100 years. The nested top_hits aggregations will remember a single document with the greatest score from each bucket (because of the size property set to 1). We added the include option only for simplicity of the results, so that we only return the title and available fields for every aggregated document. The response returned by Elasticsearch should be similar to the following one:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "when": {
         "buckets": [
            {
               "key_as_string": "1800",
               "key": 1800,
               "doc_count": 1,
               "book": {
                  "hits": {
                     "total": 1,
                     "max_score": 1,
                     "hits": [
                        {
                           "_index": "library",
                           "_type": "book",
                           "_id": "4",
                           "_score": 1,
                           "_source": {
                              "title": "Crime and Punishment",
                              "available": true
                           }
                        }
                     ]
                  }
               }
            },
            {
               "key_as_string": "1900",
               "key": 1900,
               "doc_count": 3,
               "book": {
                  "hits": {
                     "total": 3,
                     "max_score": 1,
                     "hits": [
                        {
                           "_index": "library",
                           "_type": "book",
                           "_id": "3",
                           "_score": 1,
                           "_source": {
                              "title": "The Complete Sherlock  Holmes",
                              "available": false
                           }
                        }
                     ]
                  }
               }
            }
         ]
      }
   }
}

The interesting parts of the response are highlighted. We can see that because of the top_hits aggregation, we have the most scoring document (from each bucket) included in the response. In our particular case, the query was the match_all one and all the documents have the same score, so the top scoring document for every bucket is more or less random. Elasticsearch used the match_all query because we didn't specify any query at all—this is the default behavior. If we want to have a custom sorting, this is not a problem for Elasticsearch. For example, we can return the first book from a given century. What we just need to do is add a proper sorting option, just like in the following query:

curl -XGET 'http://127.0.0.1:9200/library/_search?pretty' -d '{
  "size": 0,
  "aggs": {
    "when": {
      "histogram": {
        "field": "year",
        "interval": 100
      },
      "aggs": {
        "book": {
          "top_hits": {
            "sort": {
                "year": "asc"
            },
            "_source": {
              "include": [
                "title",
                "available"
              ]
            },
            "size": 1
          }
        }
      }
    }
  }
}'

Please take a look at the highlighted fragment of the preceding query. We've added sorting to the top_hits aggregation, so the results are sorted on the basis of the year field. This means that the first document will be the one with the lowest value in that field and this is the document that is going to be returned for each bucket.

Additional parameters

However, sorting and field inclusion is not everything that we can we do inside the top_hits aggregation. Elasticsearch allows using several other functionalities related to documents retrieval. We don't want to discuss them all in detail because you should be familiar with most of them if you are familiar with the Elasticsearch aggregation module. However, for the purpose of this chapter, let's look at the following example:

curl -XGET 'http://127.0.0.1:9200/library/_search?pretty' -d '{
   "query": {
      "filtered": {
         "query": {
            "match": {
               "_all": "quiet"
            }
         },
         "filter": {
            "term": {
               "copies": 1,
               "_name": "copies_filter"
            }
         }
      }
   },
   "size": 0,
   "aggs": {
      "when": {
         "histogram": {
            "field": "year",
            "interval": 100
         },
         "aggs": {
            "book": {
               "top_hits": {
                  "highlight": {
                     "fields": {
                        "title": {}
                     }
                  },
                  "explain": true,
                  "version": true,
                  "_source": {
                     "include": [
                        "title",
                        "available"
                     ]
                  },
                  "fielddata_fields" : ["title"],
                  "script_fields": {
                     "century": {
                        "script": "(doc["year"].value /  100).intValue()"
                     }
                  },
                  "size": 1
               }
            }
         }
      }
   }
}'

As you can see, our query contains the following functionalities:

  • Named filters and queries (in our example the filter is named copies_filter)
  • Document version inclusion
  • Document source filtering (choosing fields that should be returned)
  • Using field-data fields and script fields
  • Inclusion of explained information that tells us why a given document was matched and included
  • Highlighting usage
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset