Aggregation types

Elasticsearch 2.x allows us to use three types of aggregation: metrics, buckets, and pipeline. The metrics aggregations return a metric, just like the stats aggregation we used for the stats field. The bucket aggregations return buckets, the key and the number of documents sharing the same values, ranges, and so on, just like the terms aggregation we used for the copies field. Finally, the pipeline aggregations introduced in Elasticsearch 2.0 aggregate the output of the other aggregations and their metrics, which allows us to do even more sophisticated data analysis. Knowing all that, let's now look at all the aggregations we can use in Elasticsearch 2.x.

Metrics aggregations

We will start with the metrics aggregations, which can aggregate values from documents into a single metric. This is always the case with metrics aggregations – you can expect them to be a single metric on the basis of the data. Let's now take a look at the metrics aggregations available in Elasticsearch 2.x.

Minimum, maximum, average, and sum

The first group of metrics aggregations that we want to show you is the one that calculates the basic value from the given documents. These aggregations are:

  • min: This calculates the minimum value from the given numeric field in the returned documents
  • max: This calculates the maximum value from the given numeric field in the returned documents
  • avg: This calculates an average from the given numeric field in the returned documents
  • sum: This calculates the sum from the given numeric field in the returned documents

As you can see, the preceding mentioned aggregations are pretty self-explanatory. So, let's try to calculate the average value on our data. For example, let's assume that we want to calculate the average number of copies for our books. The query to do that will look as follows:

{
 "aggs" : {
  "avg_copies" : {
   "avg" : {
    "field" : "copies"
   }
  }
 }
}

The results returned by Elasticsearch after running the preceding query will be as follows:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "avg_copies" : {
      "value" : 1.75
    }
  }
}

So, we have an average of 1.75 copies per book. It is very easy to calculate – (6 + 0 + 1 + 0) / 4 is equal to 1.75. Seems that we got it right.

Missing values

The nice thing about the previously mentioned aggregations is that we can control what value Elasticsearch can use if the fields we've specified don't have any. For example, if we wanted Elasticsearch to use 0 as the value for the copies field in our previous example, we would add the missing property to our query and and set it to 0. For example:

{
 "aggs" : {
  "avg_copies" : {
   "avg" : {
    "field" : "copies",
    "missing" : 0
   }
  }
 }
}

Using scripts

The input values can also be generated by a script. For example, if we want to find the minimum value from all the values in the year field, but we want to subtract 1000 from those values, we will send an aggregation like the following one:

{
 "aggs": {
  "min_year": {
   "min": {
    "script": "doc['year'].value - 1000"
   }
  }
 }
}

Note

Note that the preceding query requires inline scripts to be allowed. This means that the query requires the script.inline property set to on in the elasticsearch.yml file.

In this case, the value the aggregations will use will be the original year field value reduced by 1000.

We can also use the value script capabilities of Elasticsearch. For example, to achieve the same as the previous script, we can use the following query:

{
 "aggs": {
  "min_year": {
   "min": {
    "field" : "year",
    "script" : {
     "inline" : "_value - factor",
     "params" : {
      "factor" : 1000
     }
    }
   }
  }
 }
}

If you are not familiar with Elasticsearch scripting capabilities, you can read more about it in the Scripting capabilities of Elasticsearch section of Chapter 6, Make Your Search Better.

One thing worth remembering is that using the command line may require proper escaping of the values in the doc array. For example, the command that executes the first scripted query would look as follows:

curl -XGET 'localhost:9200/library/_search?size=0&pretty' -d '{
 "aggs": {
  "min_year": {
   "min": {
    "script": "doc["year"].value - 1000"
   }
  }
 }
}'

Field value statistics and extended statistics

The next aggregations we will discuss are the ones that provide us with the statistical information about the numeric field we are running the aggregation on: the stats and extended_stats aggregations.

For example, the following query provides extended statistics for the year field:

{
 "aggs" : {
  "extended_statistics" : {
   "extended_stats" : {
    "field" : "year"
   }
  }
 }
}

The response to the preceding query will be as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "extended_statistics" : {
      "count" : 4,
      "min" : 1886.0,
      "max" : 1961.0,
      "avg" : 1928.0,
      "sum" : 7712.0,
      "sum_of_squares" : 1.4871654E7,
      "variance" : 729.5,
      "std_deviation" : 27.00925767213901,
      "std_deviation_bounds" : {
        "upper" : 1982.018515344278,
        "lower" : 1873.981484655722
      }
    }
  }
}

As you can see, in the response we got information about the number of documents with value in the year field, the minimum value, the maximum value, the average, and the sum. These are the values that we will get if we run the stats aggregation instead of extended_stats. The extended_stats aggregation provides additional information, such as the sum of squares, variance, and standard deviation. Elasticsearch provides two types of aggregations because extended_stats is slightly more expensive when it comes to processing power.

Note

The stats and extended_stats aggregations, similar to the min, max, avg, and sum aggregations, support scripting and allow us to specify which value should be used for the fields that don't have value in the specified field.

Value count

The value_count aggregation is a simple aggregation which allows counting values in aggregated documents. This is quite useful when used with nested aggregations. We are not focusing on that topic right now, but it is something to keep in mind. For example, to use the value_count aggregation on the copied field, we will run the following query:

{
 "aggs" : {
  "count" : {
   "value_count" : {
    "field" : "copies"
   }
  }
 }
}

Note

The value_count aggregation allows us to use scripts, discussed earlier in this chapter when we described the min, max, avg, and sum aggregations. Please refer to the beginning of Metrics aggregation section earlier in the current chapter for further reference.

Field cardinality

One of the aggregation that allows us to control how resource hungry the aggregation will be by controlling its precision, the cardinality aggregation calculates the count of distinct values in a given field. However, one thing needs to be remembered: the calculated count is an approximation, not the exact value. Elasticsearch uses the HyperLogLog++ algorithm (http://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/40671.pdf) to calculate the value.

This aggregation has a wide variety of use cases, such as showing the number of distinct values in a field that is responsible for holding the status code for your indexed Apache access logs. One query, and you know the approximated count of the distinct values in that field.

For example, we can request the cardinality for our title field:

{
 "aggs" : {
  "card_title" : {
   "cardinality" : {
    "field" : "title"
   }
  }
 }
}

To control the precision of the cardinality calculation, we can specify the precision_threshold property – the higher the value, the more precise the aggregation will be and the more resources it will need. The current maximum precision_threshold value is 40000 and the default depends on the parent aggregation. An example query using the precision_threshold property looks as follows:

{
 "aggs" : {
  "card_title" : {
   "cardinality" : {
    "field" : "title",
    "precision_threshold" : 1000
   }
  }
 }
}

Percentiles

The percentiles aggregation is another example of aggregation in Elasticsearch. It uses an algorithmic approximation approach to provide us with results. It uses the T-Digest algorithm (https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf) from Ted Dunning and Otmar Ertl and allows us to calculate percentiles: metrics that show us how many results are above a certain value. For example, the 99th percentile shows us the value that is greater than 99 percent of the other values.

Let's go into an example and look at a query that will calculate percentiles for the year field in our data:

{
 "aggs" : {
  "copies_percentiles" : {
   "percentiles" : {
    "field" : "year"
   }
  }
 }
}

The results returned by Elasticsearch for the preceding request will look as follows:

{
  "took" : 26,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "copies_percentiles" : {
      "values" : {
        "1.0" : 1887.2899999999997,
        "5.0" : 1892.4499999999998,
        "25.0" : 1918.25,
        "50.0" : 1932.5,
        "75.0" : 1942.25,
        "95.0" : 1957.25,
        "99.0" : 1960.25
      }
    }
  }
}

As you can see, the value that is higher than 99 percent of the values is 1960.25.

You may wonder why such aggregation is important. It is very useful for performance metrics; for example, where we usually look at averages for some period of time. Imagine that the average response time of our queries for the last hour is 50 milliseconds, which is not bad. However, if the 95th percentile would show 2 seconds, that would mean that about 5 percent of the users had to wait two or more seconds for the search results, which is not that good.

By default, the percentiles aggregation calculates seven percentiles: 1, 5, 25, 50, 75, 95, and 99. We can control this by using the percents property and specify which percentiles we are interested in. For example, if we want to get only the 95th and the 99th percentile, we change our query to the following one:

{
 "aggs" : {
  "copies_percentiles" : {
   "percentiles" : {
    "field" : "year",
    "percents" : [ "95", "99" ]
   }
  }
 }
}

Note

Similar to the min, max, avg, and sum aggregations, the percentiles aggregation supports scripting and allows us to specify which value should be used for the fields that don't have value in the specified field.

We've mentioned earlier that the percentiles aggregation uses an algorithmic approach and is an approximation. As with all approximations, we can control the precision and memory usage of the algorithm. We do that by using the compression property, which defaults to 100. It is an internal property of Elasticsearch and its implementation details may change between versions. It is worth knowing that setting the compression value to one higher than 100 can increase the algorithm precision at the cost of memory usage.

Percentile ranks

The percentile_ranks aggregation is similar to the percentiles one that we just discussed. It allows us to show which percentile a given value has. For example, to show us which percentile year 1932 and year 1960 are, we run the following query:

{
 "aggs" : {
  "copies_percentile_ranks" : {
   "percentile_ranks" : {
    "field" : "year",
    "values" : [ "1932", "1960" ]
   }
  }
 }
}

The response returned by Elasticsearch will be as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "copies_percentile_ranks" : {
      "values" : {
        "1932.0" : 49.5,
        "1960.0" : 61.5
      }
    }
  }
}

Top hits aggregation

The top_hits aggregation keeps track of the most relevant document being aggregated. This doesn't sound very appealing, but it allows us to implement one of the most desired functionalities in Elasticsearch called document grouping, field collapsing, or document folding. Such functionality is very useful in some use cases—for example, when we want to show a book catalog but only one from a single publisher. To do that without the top_hits aggregation, we would have to run multiple queries. With the top_hits aggregation, we need only a single query.

The top_hits aggregation was introduced in Elasticsearch 1.3. In fact, the mentioned document folding is more or less a side effect and only one of the possible usage examples of the top_hits aggregation.

The idea behind the top_hits aggregation is simple. Every document that is assigned to a particular bucket can be also remembered. By default, only three documents per bucket are remembered.

Note

Note that, in order to show the full potential of the top_hits aggregation, we decided to use one of the bucketing aggregations as well and nest them to show the document grouping functionality implementation. The bucketing aggregations are described in detail later in this chapter.

To show you a potential use case that leverages the top_hits aggregation, we have decided to use a simple example. We would like to get the most relevant book published every 100 years. To do that we use the following query:

{
 "aggs": {
  "when": {
   "histogram": {
    "field": "year",
    "interval": 100
   },
   "aggs": {
    "book": {
     "top_hits": {
      "_source": {
       "include": [ "title", "available" ]
      },
      "size": 1
     }
    }
   }
  }
 }
}

In the preceding example, we did the histogram aggregation on year ranges. Every bucket was created for every one hundred years. The nested top_hits aggregations remembers a single document with the greatest score from each bucket (because of the size property being set to 1). We added the include option only for simpler results, so that we only return the title and available fields for every aggregated document. The response returned by Elasticsearch will be as follows:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "when" : {
      "buckets" : [ {
        "key" : 1800,
        "doc_count" : 1,
        "book" : {
          "hits" : {
            "total" : 1,
            "max_score" : 1.0,
            "hits" : [ {
              "_index" : "library",
              "_type" : "book",
              "_id" : "4",
              "_score" : 1.0,
              "_source" : {
                "available" : true,
                "title" : "Crime and Punishment"
              }
            } ]
          }
        }
      }, {
        "key" : 1900,
        "doc_count" : 3,
        "book" : {
          "hits" : {
            "total" : 3,
            "max_score" : 1.0,
            "hits" : [ {
              "_index" : "library",
              "_type" : "book",
              "_id" : "2",
              "_score" : 1.0,
              "_source" : {
                "available" : false,
                "title" : "Catch-22"
              }
            } ]
          }
        }
      } ]
    }
  }
}

We can see that, because of the top_hits aggregation, we have the most scoring document (from each bucket) included in the response. In our particular case, the query was the match_all one and all the documents had the same score, so the top-scoring document for every bucket was more or less random. However, you need to remember that this is the default behavior. If we want to have custom sorting, this is not a problem for Elasticsearch. We just need to add the sort property for our top_hits aggregator. For example, we can return the first book from a given century:

{
 "aggs": {
  "when": {
   "histogram": {
    "field": "year",
    "interval": 100
   },
   "aggs": {
    "book": {
     "top_hits": {
      "sort": {
       "year": "asc"
      },
      "_source": {
       "include": [ "title", "available" ]
      },
      "size": 1
     }
    }
   }
  }
 }
}

We added sorting to the top_hits aggregation, so the results are sorted on the basis of the year field. This means that the first document will be the one with the lowest value in that field and this is the document that is going to be returned for each bucket.

Additional parameters

Sorting and field inclusion is not everything that we can we do inside the top_hits aggregation. Because this aggregation returns documents, we can also use functionalities such as:

  • highlighting
  • explain
  • scripting
  • fielddata field (uninverted representation of the fields)
  • version

We just need to include an appropriate section in the top_hits aggregation body, similar to what we do when we construct a query. For example:

{
 "aggs": {
  "when": {
   "histogram": {
    "field": "year",
    "interval": 100
   },
   "aggs": {
    "book": {
     "top_hits": {
      "highlight": {
       "fields": {
       "title": {}
       }
      },
      "explain": true,
      "version": true,
      "_source": {
       "include": [ "title", "available" ]
      },
      "fielddata_fields" : ["title"],
      "script_fields": {
       "century": {
        "script": "(doc["year"].value / 100).intValue()"
       }
      },
      "size": 1
     }
    }
   }
  }
 }
}

Note

Note that the preceding query requires the inline scripts to be allowed. This means that the query requires the script.inline property set to on in the elasticsearch.yml file.

Geo bounds aggregation

The geo_bounds aggregation is a simple aggregation that allows us to compute the bounding box that includes all the geo_point type field values from the aggregated documents.

Note

If you are interested in spatial searches, the section dedicated to it is called Geo and is included in Chapter 8, Beyond Full-text Searching.

We only need to provide the field (by using the field property; it needs to be of the geo_point type). We can also provide wrap_longitude (values true or false; it defaults to true) if the bounding box is allowed to overlap the international date line. In response, we get the latitude and longitude of the top-left and bottom-right corners of the bounding box. An example query using this aggregation looks as follows (using the hypothetical location field):

{
 "aggs" : {
  "box" : {
   "geo_bounds" : {
    "field" : "location"
   }
  }
 }
}

Scripted metrics aggregation

The last metric aggregation we want to discuss is the scripted_metric aggregation, which allows us to define our own aggregation calculation using scripts. For this aggregation, we can provide the following scripts (map_script is the only required one, the rest are optional):

  • init_script: This script is run during initialization and allows us to set up an initial state of the calculation.
  • map_script: This is the only required script. It is executed once for every document that needs to store the calculation in an object called _agg.
  • combine_script: This script is executed once on each shard after Elasticsearch finishes document collection on that shard.
  • reduce_script: This script is executed once on the node that is coordinating a particular query execution. This script has access to the _aggs variable, which is an array of the values returned by combine_script.

For example, we can use the scripted_metric aggregation to calculate all the copies of all the books we have in our library by running the following request (we show the whole request to show how the names are escaped):

curl -XGET 'localhost:9200/library/_search?size=0&pretty' -d '{
 "aggs" : {
  "all_copies" : {
   "scripted_metric" : {
    "init_script" : "_agg["all_copies"] = 0",
    "map_script" : "_agg.all_copies += doc.copies.value",
    "combine_script" : "return _agg.all_copies",
    "reduce_script" : "sum = 0; for (number in _aggs) { sum += number }; return sum"
   }
  }
 }
}'

Of course, the preceding script is just a simple sum and we could use sum aggregation, but we just wanted to show you a simple example of what you can do with the scripted_metric aggregation.

Note

Note that the preceding query requires inline scripts to be allowed. This means that the query requires the script.inline property set to on in the elasticsearch.yml file.

As you can see, the init_script part of the aggregation is used to initialize the all_copies variable. Next, we have map_script, which is executed once for every document and we just add the value of the copies field to the earlier initialized variable. The combine_script part, executed once on each shard, tells Elasticsearch to return the calculated variable. Finally, the reduce_script part, executed once for the whole query on the aggregator node, will run a for loop, which will go through all the returned values that are stored in the _aggs array and return the sum of those. The final result returned by Elasticsearch for the preceding query looks as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "all_copies" : {
      "value" : 7
    }
  }
}

Buckets aggregations

The second type of aggregations that we will discuss are the buckets aggregations. In comparison to metrics aggregations, bucket aggregation returns data not as a single metric but as a list of key value pairs called buckets. For example, the terms aggregation returns the number of documents associated with each term in a given field. The very powerful thing about buckets aggregations is that they can have sub-aggregations, which means that we can nest other aggregations inside the aggregations that return buckets (we will discuss this at the end of the buckets aggregation discussion). Let's look at the bucket aggregations that are provided by Elasticsearch now.

Filter aggregation

The filter aggregation is a simple bucketing aggregation that allows us to filter the results to a single bucket. For example, let's assume that we want to get a count and the average copies count of all the books that are novels, which means they have the term novel in the tags field. The query that will return such results looks as follows:

{
 "aggs" : {
  "novels_count" : {
   "filter" : {
    "term": {
     "tags": "novel"
    }
   },
   "aggs" : {
    "avg_copies" : {
     "avg" : {
      "field" : "copies"
     }
    }
   }
  }
 }
}

As you can see, we defined the filter in the filter section of the aggregation definition and we defined a second nested aggregation. The nested aggregation is the one that will be run on the filtered documents.

The response returned by Elasticsearch looks as follows:

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "novels_count" : {
      "doc_count" : 2,
      "avg_copies" : {
        "value" : 3.5
      }
    }
  }
}

In the returned bucket, we have information about the number of documents (represented by the doc_count property) and the average number of copies, which is all we wanted.

Filters aggregation

The second bucket aggregation we want to show you is the filters aggregation. While the previously discussed filter aggregation resulted in a single bucket, the filters aggregation returns multiple buckets – one for each of the defined filters. Let's extend our previous example and assume that, in addition to the average number of copies for the novels, we also want to know the average number of copies for the books that are available. The query that will get us this information will use the filters aggregation and will look as follows:

{
 "aggs" : {
  "count" : {
   "filters" : {
    "filters" : {
     "novels" : {
      "term" : {
       "tags" : "novel"
      }
     },
     "available" : {
      "term" : {
       "available" : true
      }
     }
    }
   },
   "aggs" : {
    "avg_copies" : {
     "avg" : {
      "field" : "copies"
     }
    }
   }
  }
 }
}

Let's stop here and look at the definition of the aggregation. As you can see, we defined two filters using the filters section of the filters aggregation. Each filter has a name and the actual Elasticsearch filter; the first is called novels and the second is called available. Elasticsearch will use these names in the returned response. The thing to remember is that Elasticsearch will create a bucket for each defined filter and will calculate the nested aggregation that we defined – in our case, the one that calculates the average number of copies.

Note

The filters aggregation allows us to return one more bucket in addition to the defined ones – a bucket with all the documents that didn't match the filters. In order to calculate such a bucket, we need to add the other_bucket property to the body of the aggregation and set it to true.

The results returned by Elasticsearch are as follows:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "count" : {
      "buckets" : {
        "novels" : {
          "doc_count" : 2,
          "avg_copies" : {
            "value" : 3.5
          }
        },
        "available" : {
          "doc_count" : 2,
          "avg_copies" : {
            "value" : 0.5
          }
        }
      }
    }
  }
}

As you can see, we got two buckets, which is what we expected.

Terms aggregation

One of the most commonly used bucket aggregations is the terms aggregation. It allows us to get information about the terms and the count of documents having those terms. For example, one of the simplest uses is getting the count of the books that are available and not available. We can do that by running the following query:

{
 "aggs" : {
  "counts" : {
   "terms" : {
    "field" : "available"
   }
  }
 }
}

In the response, we will get two buckets (because the Boolean field can only have two values – true and false). Here, this will look as follows:

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "counts" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : 0,
        "key_as_string" : "false",
        "doc_count" : 2
      }, {
        "key" : 1,
        "key_as_string" : "true",
        "doc_count" : 2
      } ]
    }
  }
}

By default, the data is sorted on the basis of document count, which means that the most common terms will be placed on top of the aggregation results. Of course, we can control this behavior by specifying the order property and providing the order just like we usually do when sorting by arbitrary field values. Elasticsearch allows us to sort by the document count (using the _count static value) and by the term (using the _term static value). For example, if we want to sort our preceding aggregation results by descending term, we can run the following query:

{
 "aggs" : {
  "counts" : {
   "terms" : {
    "field" : "available",
    "order" : { "_term" : "desc" }   }
  }
 }
}

However, that's not all when it comes to sorting. We can also sort by the results of the nested aggregations that were included in the query.

Note

terms aggregation, similar to the min, max, avg, and sum aggregations discussed in the metrics aggregation section of this chapter, supports scripting and allows us to specify which value should be used for the fields that don't have a value in the specified field.

Counts are approximate

The thing to remember when discussing terms aggregation is that the counts are approximate. This is because each shard provides its own counts and returns that aggregated information to the coordinating node. The coordinating node aggregates the information it got returning the final information to the client. Because of that, depending on the data and how it is distributed between the shards, some information about the counts may be lost and the counts will not be exact. Of course, when dealing with low cardinality fields, the approximation will be closer to exact numbers, but still this is something that should be considered when using the terms aggregation.

We can control how much information is returned from each of the shards to the coordinating node. We can do this by specifying the size and the shard_size properties. The size property specifies how many buckets will be returned at most. The higher the size property, the more accurate the calculation will be. However, that will cost us additional memory and CPU cycles, which means that the calculation will be more expensive and will put more pressure on the hardware. This is because the results returned to the coordinating node from each shard will be larger and the result merging process will be harder.

The shard_size property can be used to minimize the work that needs to be done by the coordinating node. When set, the coordinating node will fetch (from each shard) the number of buckets determined by the shard_size property. This allows us to increase the precision of the aggregation while avoiding the additional overhead on the coordinating node. Remember that the shard_size property cannot be smaller than the size property.

Finally, the size property can be set to 0, which will tell Elasticsearch not to limit the number of returned buckets. It is usually not wise to set the size property to 0 as it can result in high resource consumption. Also, avoid setting the size property to 0 for high cardinality fields as this will likely make your Elasticsearch cluster explode.

Minimum document count

Elasticsearch provides us with two additional properties, which can be useful in certain situations: min_doc_count and shard_min_doc_count. The min_doc_count property defaults to 1 and specifies how many documents must match a term to be included in the aggregation results. One thing to remember is that setting the min_doc_count property to 0 will result in returning all the terms, no matter if they have a matching document or not. This can result in a very large result set for aggregation results. For example, if we want to return terms matched by 5 or more documents, we will run the following query:

{
 "aggs" : {
  "counts" : {
   "terms" : {
    "field" : "available",
    "min_doc_count" : 5   }
  }
 }
}

The shard_min_doc_count property is very similar and defines how many documents must match a term to be included in the aggregation's results, but on the shard level.

Range aggregation

The range aggregation allows us to define one or more ranges and Elasticsearch calculates buckets for them. For example, if we want to check how many books were published in a given period of time, we create the following query:

{
 "aggs": {
  "years": {
   "range": {
    "field": "year",
    "ranges": [
     { "to" : 1850 },
     { "from": 1851, "to": 1900 },
     { "from": 1901, "to": 1950 },
     { "from": 1951, "to": 2000 },
     { "from": 2001 }
    ]
   }
  }
 }
}

We specify the field we want the aggregation to be calculated on and the array of ranges. Each range is defined by one or two properties: the two and from similar to the range queries which we already discussed.

The result returned by Elasticsearch for our data looks as follows:

{
  "took" : 23,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "years" : {
      "buckets" : [ {
        "key" : "*-1850.0",
        "to" : 1850.0,
        "to_as_string" : "1850.0",
        "doc_count" : 0
      }, {
        "key" : "1851.0-1900.0",
        "from" : 1851.0,
        "from_as_string" : "1851.0",
        "to" : 1900.0,
        "to_as_string" : "1900.0",
        "doc_count" : 1
      }, {
        "key" : "1901.0-1950.0",
        "from" : 1901.0,
        "from_as_string" : "1901.0",
        "to" : 1950.0,
        "to_as_string" : "1950.0",
        "doc_count" : 2
      }, {
        "key" : "1951.0-2000.0",
        "from" : 1951.0,
        "from_as_string" : "1951.0",
        "to" : 2000.0,
        "to_as_string" : "2000.0",
        "doc_count" : 1
      }, {
        "key" : "2001.0-*",
        "from" : 2001.0,
        "from_as_string" : "2001.0",
        "doc_count" : 0
      } ]
    }
  }
}

For example, between 1901 and 1950 we had two books released.

Note

The range aggregation, similar to the min, max, avg, and sum aggregations discussed in the metrics aggregations section of this chapter, supports scripting and allows us to specify which value should be used for the fields that don't have a value in the specified field.

Keyed buckets

One thing that should mention when it comes to the range aggregation is that we can give the defined ranges names. For example, let's assume that we want to use the names Before 18th century for the books released before 1799, 18th century for the books released between 1800 and 1900, 19th century for the books released between 1900 and 1999, and After 19th century for the books released after 2000. We can do this by adding the key property to each defined range, giving it the name, and adding the keyed property set to true. Setting the keyed property to true will associate a unique string value to each bucket and the key property defines the name for the bucket that will be used as the unique name. A query that does that will look as follows:

{
 "aggs": {
  "years": {
   "range": {
    "field": "year",
    "keyed": true,
    "ranges": [
     { "key": "Before 18th century", "to": 1799 },
     { "key": "18th century", "from": 1800, "to": 1899 },
     { "key": "19th century", "from": 1900, "to": 1999 },
     { "key": "After 19th century", "from": 2000 }
    ]
   }
  }
 }
}

The response returned by Elasticsearch in such a case will look as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "years" : {
      "buckets" : {
        "Before 18th century" : {
          "to" : 1799.0,
          "to_as_string" : "1799.0",
          "doc_count" : 0
        },
        "18th century" : {
          "from" : 1800.0,
          "from_as_string" : "1800.0",
          "to" : 1899.0,
          "to_as_string" : "1899.0",
          "doc_count" : 1
        },
        "19th century" : {
          "from" : 1900.0,
          "from_as_string" : "1900.0",
          "to" : 1999.0,
          "to_as_string" : "1999.0",
          "doc_count" : 3
        },
        "After 19th century" : {
          "from" : 2000.0,
          "from_as_string" : "2000.0",
          "doc_count" : 0
        }
      }
    }
  }
}

Note

An important and quite useful point about the range aggregation is that the defined ranges need not be disjoint. In such cases, Elasticsearch will properly count the document for multiple buckets.

Date range aggregation

The date_range aggregation is similar to the previously discussed range aggregation but it is designed for fields that use date-based types. However, in the library index, the documents have years, but the field is a number, not a date. For the purpose of showing how this aggregation works, let's imagine that we want to extend our library index to support newspapers. To do this we will create a new index (called library2) by using the following command:

curl -XPOST localhost:9200/_bulk --data-binary '{ "index": {"_index": "library2", "_type": "book", "_id": "1"}}
{ "title": "Fishing news", "published": "2010/12/03 10:00:00", "copies": 3, "available": true }
{ "index": {"_index": "library2", "_type": "book", "_id": "2"}}
{ "title": "Knitting magazine", "published": "2010/11/07 11:32:00", "copies": 1, "available": true }
{ "index": {"_index": "library2", "_type": "book", "_id": "3"}}
{ "title": "The guardian", "published": "2009/07/13 04:33:00", "copies": 0, "available": false }
{ "index": {"_index": "library2", "_type": "book", "_id": "4"}}
{ "title": "Hadoop World", "published": "2012/01/01 04:00:00", "copies": 6, "available": true }
'

For the purpose of this example, we will leave the mappings definition for Elasticsearch – this is sufficient in this case. Let's start with the first query using the date_range aggregation:

{
 "aggs": {
  "years": {
   "date_range": {
    "field": "published",
    "ranges": [
     { "to" : "2009/12/31" },
     { "from": "2010/01/01", "to": "2010/12/31" },
     { "from": "2011/01/01" }
    ]
   }
  }
 }
}

Compared with the ordinary range aggregation, the only thing that changed is the aggregation type, which is now date_range. The dates can be passed as a string in a form recognized by Elasticsearch or as a number value (number of milliseconds since 1970-01-01). The response returned by Elasticsearch for the preceding query looks as follows:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "years" : {
      "buckets" : [ {
        "key" : "*-2009/12/31 00:00:00",
        "to" : 1.2622176E12,
        "to_as_string" : "2009/12/31 00:00:00",
        "doc_count" : 1
      }, {
        "key" : "2010/01/01 00:00:00-2010/12/31 00:00:00",
        "from" : 1.262304E12,
        "from_as_string" : "2010/01/01 00:00:00",
        "to" : 1.2937536E12,
        "to_as_string" : "2010/12/31 00:00:00",
        "doc_count" : 2
      }, {
        "key" : "2011/01/01 00:00:00-*",
        "from" : 1.29384E12,
        "from_as_string" : "2011/01/01 00:00:00",
        "doc_count" : 1
      } ]
    }
  }
}

As you can see, the response is no different when compared to the response returned by the range aggregation. We have two attributes for each bucket - named from and to which represent the number of milliseconds from 1970-01-01. The properties from_as_string and to_as_string present the same information as from and to, but in a human-readable form. Of course the keyed parameter and key in the definition of date range work in the already described way.

Elasticsearch also allows us to define the format of presented dates using the format attribute. In our example, we presented the dates with year resolution, so the day and time parts were unnecessary. If we want to show the month names, we can send a query such as the following one:

{
 "aggs": {
  "years": {
   "date_range": {
    "field": "published",
    "format": "MMMM YYYY",
    "ranges": [
     { "to" : "December 2009" },
     { "from": "January 2010", "to": "December 2010" },
     { "from": "January 2011" }
    ]
   }
  }
 }
}

Note that the dates in the to and from parameters also need to be provided in the specified format. One of the returned ranges looks as follows:

{
 "key" : "January 2010-December 2010",
 "from" : 1.262304E12,
 "from_as_string" : "January 2010",
 "to" : 1.2911616E12,
 "to_as_string" : "December 2010",
 "doc_count" : 1
}

Note

The available formats we can use in format are defined in the Joda Time library. The full list is available at http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html.

There is one more thing about the date_range aggregation that we want to mention. Imagine that some time we may want to build an aggregation that can change with time. For example, we may want to see how many newspapers were published in the last 3, 6, 9, and 12 months. This is possible without the need to adjust the query every time, as we can use constants such as now-9M. The following example shows this:

{
 "aggs": {
  "years": {
   "date_range": {
    "field": "published",
    "format": "dd-MM-YYYY",
    "ranges": [
     { "to" : "now-9M/M"  },
     { "to" : "now-9M"  },
     { "from": "now-6M/M", "to": "now-9M/M" },
     { "from": "now-3M/M" }
    ]
   }
  }
 }
}

The key here is expressions such as now-9M. Elasticsearch does the math and generates the appropriate value. For example, you can use y (year), M (month), w (week), d (day), h (hour), m (minute), and s (second). For example, the expression now+3d means three days from now. The /M in our example takes only the date rounded to months. Thanks to such notation, we only count full months. The second advantage is that the calculated date is more cache-friendly without the rounding date changes every millisecond that make every cache based on the range irrelevant and basically useless in most cases.

IPv4 range aggregation

A very interesting aggregation is the ip_range one as it works on Internet addresses. It works on the fields defined with the ip type and allows defining ranges given by the IP range in CIDR notation (http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing). An example usage of the ip_range aggregation looks as follows:

{
 "aggs": {
  "access": {
   "ip_range": {
    "field": "ip",
    "ranges": [
     { "from": "192.168.0.1", "to": "192.168.0.254" },
     { "mask": "192.168.1.0/24" }
    ]
   }
  }
 }
}

The response to the preceding query is as follows:

      "access": {
         "buckets": [
            {
               "from": 3232235521,
               "from_as_string": "192.168.0.1",
               "to": 3232235774,
               "to_as_string": "192.168.0.254",
               "doc_count": 0
            },
            {
               "key": "192.168.1.0/24",
               "from": 3232235776,
               "from_as_string": "192.168.1.0",
               "to": 3232236032,
               "to_as_string": "192.168.2.0",
               "doc_count": 4
            }
         ]
      }

Similar to the range aggregation, we define both ends of the brackets and the mask. The rest is done by Elasticsearch itself.

Missing aggregation

The missing aggregation allows us to create a bucket and see how many documents have no value in a specified field. For example, we can check how many of our books in the library index don't have the original title defined – the otitle field. To do this, we run the following query:

{
 "aggs": {
  "missing_original_title": {
   "missing": {
    "field": "otitle"
   }
  }
 }
}

The response returned by Elasticsearch in this case will look as follows:

{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "missing_original_title" : {
      "doc_count" : 2
    }
  }
}

As we can see, we have two documents without the otitle field.

Histogram aggregation

The histogram aggregation is an interesting one because of its automation. This aggregation defines buckets itself. We are only responsible for defining the field and the interval, and the rest is done automatically. The simplest form of a query that uses this aggregation looks as follows:

{
 "aggs": {
  "years": {
   "histogram": {
    "field" : "year",
    "interval": 100
   }
  }
 }
}

The new information we need to provide is interval, which defines the length of every range that will be used to create a bucket. We set the interval to 100, which in our case will result in buckets that are 100 years wide. The aggregation part of the response to the preceding query that was sent to our library index is as follows:

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "years" : {
      "buckets" : [ {
        "key" : 1800,
        "doc_count" : 1
      }, {
        "key" : 1900,
        "doc_count" : 3
      } ]
    }
  }
}

Similar to the range aggregation, the histogram aggregation allows us to use the keyed property to define named buckets. The other available option is min_doc_count, which allows us to specify the minimum number of documents required to create a bucket. If we set the min_doc_count property to zero, Elasticsearch will also include buckets with the document count of zero. We can also use the missing property to specify the value Elasticsearch should use when a document doesn't have a value in the specified field.

Date histogram aggregation

As a date_range aggregation is a specialized form of the range aggregation, date_histogram is an extension of the histogram aggregation that works on dates. For the purpose of this example, we will again use the data we indexed when discussing the date aggregation. This means that we will run our queries against the index called library2. An example query using the date_histogram aggregation looks as follows:

{
 "aggs": {
  "years": {
   "date_histogram": {
    "field" : "published",
    "format" : "yyyy-MM-dd HH:mm",
    "interval" : "10d", 
    "min_doc_count" : 1   }
  }
 }
}

The difference between the histogram and date_histogram aggregations is the interval property. The value of this property is now a string describing the time interval, which in our case is 10 days. Of course we can set it to anything we want. It uses the same suffixes we discussed while talking about formats in the date_range aggregation. It is worth mentioning that the number can be a float value. For example, 1.5m means that the length of the bucket will be one and a half minutes. The format attribute is the same as in the date_range aggregation. Thanks to it, Elasticsearch can add a human-readable date text according to the defined format. Of course the format attribute is not required but useful. In addition to that, similar to the other range aggregations, the keyed and min_doc_count attributes still work.

Time zones

Elasticsearch stores all the dates in the UTC time zone. You can define the time zone to be used by Elasticsearch by using the time_zone attribute. By setting this property, we basically tell Elasticsearch which time zone should be used to perform the calculations. There are three notations with which to set these attributes:

  • We can set the hours offset; for example, time_zone:5
  • We can use the time format; for example, time_zone:"-04:30"
  • We can use the name of the time zone; for example, time_zone:"EuropeWarsaw"

    Note

    Look at http://joda-time.sourceforge.net/timezones.html to see the available time zones.

Geo distance aggregations

The next two aggregations are connected with maps and spatial searches. We will talk about geo types and queries in the Elasticsearch spatial capabilities section of Chapter 8, Beyond Full-text Searching, so feel free to skip these two topics now and return to them later.

Look at the following query:

{
 "aggs": {
  "neighborhood": {
   "geo_distance": {
    "field": "location",
    "origin": [-0.1275, 51.507222],
    "ranges": [
     { "to": 1200 },
     { "from": 1201 }
    ]
   }
  }
 }
}

You can see that the query is similar to the range aggregation. The preceding aggregation will calculate the number of documents that fall into two buckets: one closer than 1200 km and the second one further than 1200 km from the geographical point defined by the origin property (in the preceding case, the origin is London). The aggregation section of the response returned by Elasticsearch looks as follows:

      "neighborhood": {
         "buckets": [
            {
               "key": "*-1200.0",
               "from": 0,
               "to": 1200,
               "doc_count": 1
            },
            {
               "key": "1201.0-*",
               "from": 1201,
               "doc_count": 4
            }
         ]
      }

The keyed and the key attributes work in the geo_distance aggregation as well, so we can easily modify the response to our needs and create named buckets.

The geo_distance aggregation supports a few additional parameters that are shown in the following query:

{
 "aggs": {
  "neighborhood": {
   "geo_distance": {
    "field": "location",
    "origin": { "lon": -0.1275, "lat": 51.507222},
    "unit": "m",
    "distance_type" : "plane",
    "ranges": [
     { "to": 1200 },
     { "from": 1201 }
    ]
   }
  }
 }
}

We have highlighted three things in the preceding query. The first change is how we defined the origin point. This time we specified the location by providing the latitude and longitude explicitly.

The second change is the unit attribute. It defines the units used in the ranges array. The possible values are: km (the default, kilometers), mi (miles), in (inches), yd (yards), m (meters), cm (centimeters), and mm (millimeters).

The last attribute, distance_type, specifies how Elasticsearch calculates the distance. The possible values are (from the fastest but least accurate to the slowest but the most accurate): plane, sloppy_arc (the default), and arc.

Geohash grid aggregation

The second aggregation related to geographical analysis is based on grids and is called geohash_grid. It organizes areas into grids and assigns every location to a cell in such a grid. To do this efficiently, Elasticsearch uses Geohash (http://en.wikipedia.org/wiki/Geohash), which encodes the location into a string. The longer the string is, the more accurate the description of a particular location. For example, one letter is sufficient to declare a box of about five thousand square kilometers and 5 letters are enough to increase the accuracy to five square kilometers. Let's look at the following query:

{
 "aggs": {
  "neighborhood": {
   "geohash_grid": {
    "field": "location",
    "precision": 5
   }
  }
 }
}

We defined the geohash_grid aggregation with buckets that have a precision of five square kilometers (the precision attribute describes the number of letters used in the geohash string object). The table with resolutions versus the length of geohash can be found at https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket-geohashgrid-aggregation.html.

Of course, the more accurate we want the aggregation to be, the more resources Elasticsearch will consume, because of the number of buckets that the aggregation has to calculate. By default, Elasticsearch does not generate more than 10,000 buckets. You can change this behavior by using the size attribute, but keep in mind that the performance may suffer for very wide queries consisting of thousands of buckets.

Global aggregation

The global aggregation is an aggregation that defines a single bucket containing all the documents from a given index and type, and not influenced by the query itself. The thing that differentiates the global aggregation from all the others is that the global aggregation has an empty body. For example, look at the following query:

{
 "query" : {
  "term" : {
   "available" : "true"
  }
 },
 "aggs": {
  "all_books" : {
   "global" : {}
  }
 }
}

In our library index, we only have two available books, but the response to the preceding query looks as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "all_books" : {
      "doc_count" : 4
    }
  }
}

As you can see, the global aggregation is not bound by the query. Because the result of the global aggregation is a single bucket containing all the documents (not narrowed down by the query itself), it is a perfect candidate for use as a top-level parent aggregation for nesting aggregations.

Significant terms aggregation

The significant_terms aggregation allows us to get the terms that are relevant and probably the most significant for a given query. The good thing is that it doesn't only show the top terms from the results of the given query, but also the one that seems to be the most important one.

The use cases for this aggregation type can vary from finding the most troublesome server working in your application environment, to suggesting nicknames from text. Whenever Elasticsearch sees a significant change in the popularity of a term, such a term is a candidate for being significant.

Note

Remember that the significant_terms aggregation is very expensive when it comes to resources and running against large indices. Work is being done to provide a lightweight version of that aggregation; as a result, the API for significant_terms aggregation may change in the future.

The best way to describe the significant_terms aggregation type is to use an example. Let's start with indexing 12 simple documents, which represent reviews of work done by interns:

curl -XPOST 'localhost:9200/interns/review/1' -d '{"intern" : "Richard", "grade" : "bad", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/2' -d '{"intern" : "Ralf", "grade" : "perfect", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/3' -d '{"intern" : "Richard", "grade" : "bad", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/4' -d '{"intern" : "Richard", "grade" : "bad", "type" : "review"}'
curl -XPOST 'localhost:9200/interns/review/5' -d '{"intern" : "Richard", "grade" : "good", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/6' -d '{"intern" : "Ralf", "grade" : "good", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/7' -d '{"intern" : "Ralf", "grade" : "perfect", "type" : "review"}'
curl -XPOST 'localhost:9200/interns/review/8' -d '{"intern" : "Richard", "grade" : "medium", "type" : "review"}'
curl -XPOST 'localhost:9200/interns/review/9' -d '{"intern" : "Monica", "grade" : "medium", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/10' -d '{"intern" : "Monica", "grade" : "medium", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/11' -d '{"intern" : "Ralf", "grade" : "good", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/12' -d '{"intern" : "Ralf", "grade" : "good", "type" : "grade"}'

Of course, to show the real power of the significant_terms aggregation, we should use a way larger data set. However, for the purpose of this book, we will concentrate on this example, so it is easier to illustrate how this aggregation works.

Now let's try finding the most significant grade for Richard. To do this we will use the following query:

curl -XGET 'localhost:9200/interns/_search?size=0&pretty' -d '{
 "query" : {
  "match" : {
   "intern" : "Richard"
  }
 },
 "aggregations" : {
  "description" : {
   "significant_terms" : {
    "field" : "grade"
   }
  }
 }
}'

The result of the preceding query looks as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "description" : {
      "doc_count" : 5,
      "buckets" : [ {
        "key" : "bad",
        "doc_count" : 3,
        "score" : 0.84,
        "bg_count" : 3
      } ]
    }
  }
}

As you can see, for our query Elasticsearch informed us that the most significant grade for Richard is bad. Maybe it wasn't the best internship for him; who knows.

Choosing significant terms

To calculate significant terms, Elasticsearch looks for data that reports a significant change in their popularity between two sets of data: the foreground set and the background set. The foreground set is the data returned by our query, while the background set is the data in our index (or indices, depending on how we run our queries). If a term exists in 10 documents out of one million indexed, but appears in 5 documents from the 10 returned, then such a term is definitely significant and worth concentrating on.

Let's get back to our preceding example now to analyze it a bit. Richard got three grades from the reviewers – bad three times, medium one time, and good one time. From these three, the bad value appeared in three out of the five documents matching the query. In general, the bad grade appeared in three documents (the bg_count property) out of the 12 documents in the index (this is our background set). This gives us 25 percent of the indexed documents. On the other hand, the bad grade appeared in three out of the five documents matching the query (this is our foreground set), which gives us 60 percent of the documents. As you can see, the change in popularity is significant for the bad grade and that's why Elasticsearch has returned it in the significant_terms aggregation results.

Multiple value analysis

The significant_terms aggregation can be nested and provide us with nice data analysis capabilities that connect two multiple sets of data. For example, let's try to find a significant grade for each of the interns that we have information about. To do this we will nest the significant_terms aggregation inside the terms aggregation. The query that does that looks as follows:

curl -XGET 'localhost:9200/interns/_search?size=0&pretty' -d '{
 "aggregations" : {
  "grades" : {
   "terms" : {
    "field" : "intern"
   },
   "aggregations" : {
    "significantGrades" : {
     "significant_terms" : {
      "field" : "grade"
     }
    }
   }
  }
 }
}'

The results returned by Elasticsearch for the preceding query are as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 12,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "grades" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "ralf",
        "doc_count" : 5,
        "significantGrades" : {
          "doc_count" : 5,
          "buckets" : [ {
            "key" : "good",
            "doc_count" : 3,
            "score" : 0.48,
            "bg_count" : 4
          } ]
        }
      }, {
        "key" : "richard",
        "doc_count" : 5,
        "significantGrades" : {
          "doc_count" : 5,
          "buckets" : [ {
            "key" : "bad",
            "doc_count" : 3,
            "score" : 0.84,
            "bg_count" : 3
          } ]
        }
      }, {
        "key" : "monica",
        "doc_count" : 2,
        "significantGrades" : {
          "doc_count" : 2,
          "buckets" : [ ]
        }
      } ]
    }
  }
}

Sampler aggregation

The sampler aggregation is one of the experimental aggregations in Elasticsearch. It allows us to limit the sub aggregation processing to a sample of documents that are top-scoring ones. This allows filtering and potential removal of garbage in the data. It is a very nice candidate as a top-level aggregation to limit the amount of data the significant_terms aggregation runs on. The simplest example of using this aggregation is as follows:

{
 "aggs": {
  "sampler_example" : {
   "sampler" : {
    "field" : "tags",
    "max_docs_per_value" : 1,
    "shard_size" : 10
   },
   "aggs" : {
    "best_terms" : {
     "terms" : {
      "field" : "title"
     }
    }
   }
  }
 }
}

To see the real power of sampling, we will have to play with it on a larger data set, but for now we will discuss the preceding example. The sampler aggregation was defined with three properties: field, max_docs_per_value, and shard_size. The first two properties allow us to control the diversity of the sampling. We tell Elasticsearch how many documents at maximum (the value of the max_doc_per_value property) can be collected on a shard with the same value in the defined field (the value of the field property).

The shard_size property tells Elasticsearch how many documents (at most) to collect from each shard.

Children aggregation

The children aggregation is a single-bucket aggregation that creates a bucket with all the children of the specified type. Let's get back to the Using the parent-child relationship section in Chapter 5, Extending Your Index Structure, and let's use the created shop index. To create a bucket of all children documents with the variation type in the shop index, we run the following query:

{
 "aggs": {
  "variation_children" : {
   "children" : {
    "type" : "variation"
   }
  }
 }
}

The response returned by Elasticsearch is as follows:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "variation_children" : {
      "doc_count" : 2
    }
  }
}

Note

Because the children aggregation uses parent–child functionality, it relies on the _parent field, which needs to be present.

Nested aggregation

In the Using nested objects section of Chapter 5, Extending Your Index Structure, we learned about nested documents. Let's use that data to look into the next type of aggregation – the nested one. Let's create the simplest working query, which looks like this (we use the shop_nested index created in the mentioned chapter):

{
 "aggs": {
  "variations": {
   "nested": {
    "path": "variation"
   }
  }
 }
}

The preceding query is similar in structure to any other aggregation. However, instead of providing the field name on which the aggregation should be calculated, it contains a single parameter path, which points to the nested document. In the response we get a number of nested documents:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "variations" : {
      "doc_count" : 2
    }
  }
}

The preceding response means that we have two nested documents in the index, with the provided type variation.

Reverse nested aggregation

The reverse_nested aggregation is a special, single-bucket aggregation that allows aggregation on parent documents from the nested documents. The reverse_nested aggregation doesn't have a body similar to global aggregation. Sounds quite complicated, but it is not. Let's look at the following query that we run against the shop_nested index created in Chapter 5, Extending Your Index Structure in the Using nested objects section:

{
 "aggs": {
  "variations": {
   "nested": {
    "path": "variation"
   },
   "aggs" : {
    "sizes" : {
     "terms" : {
      "field" : "variation.size"
     },
     "aggs" : {
      "product_name_terms" : {
       "reverse_nested" : {},
       "aggs" : {
        "product_name_terms_per_size" : {
         "terms" : {
          "field" : "name"
         }
        }
       }
      }
     }
    }
   }
  }
 }
}

We start with the top level aggregation, which is the same nested aggregation that we used when discussing the nested aggregation. However, we include a sub-aggregation that uses reverse_nested to be able to show terms from the title for each size returned by the top-level nested aggregation. This is possible because, when the reverse_nested aggregation is used, Elasticsearch calculates the data on the basis of the parent documents instead of using the nested documents.

Note

Remember that the reverse_nested aggregation must be used inside the nested aggregation.

The response to the preceding query will look as follows:

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "variations" : {
      "doc_count" : 2,
      "sizes" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [ {
          "key" : "XL",
          "doc_count" : 1,
          "product_name_terms" : {
            "doc_count" : 1,
            "product_name_terms_per_size" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [ {
                "key" : "shirt",
                "doc_count" : 1
              }, {
                "key" : "test",
                "doc_count" : 1
              } ]
            }
          }
        }, {
          "key" : "XXL",
          "doc_count" : 1,
          "product_name_terms" : {
            "doc_count" : 1,
            "product_name_terms_per_size" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [ {
                "key" : "shirt",
                "doc_count" : 1
              }, {
                "key" : "test",
                "doc_count" : 1
              } ]
            }
          }
        } ]
      }
    }
  }
}

Nesting aggregations and ordering buckets

When talking about bucket aggregations, we just need to get back to the topic of nesting aggregations. This is a very powerful technique, because it allows you to further process the data for documents in the buckets. For example, the terms aggregation will return a bucket for each term and the stats aggregation can show us the statistics for documents in each bucket. For example, let's look at the following query:

{
 "aggs": {
  "copies" : {
   "terms" : {
    "field" : "copies"
   },
   "aggs" : {
    "years" : {
     "stats" : {
      "field" : "year"
     }
    }
   }
  }
 }
}

This is an example of nested aggregations. The terms aggregation will return buckets for each term from the copies field (three buckets in the case of our data), and the stats aggregation will calculate statistics for the year field for the documents falling into each bucket returned by the top aggregation. The response from Elasticsearch for the preceding query looks as follows:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "copies " : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : 0,
        "doc_count" : 2,
        "years" : {
          "count" : 2,
          "min" : 1886.0,
          "max" : 1936.0,
          "avg" : 1911.0,
          "sum" : 3822.0
        }
      }, {
        "key" : 1,
        "doc_count" : 1,
        "years" : {
          "count" : 1,
          "min" : 1929.0,
          "max" : 1929.0,
          "avg" : 1929.0,
          "sum" : 1929.0
        }
      }, {
        "key" : 6,
        "doc_count" : 1,
        "years" : {
          "count" : 1,
          "min" : 1961.0,
          "max" : 1961.0,
          "avg" : 1961.0,
          "sum" : 1961.0
        }
      } ]
    }
  }
}

This is a powerful feature and allows us to build very complex data processing pipelines. Of course, we are not limited to a single nested aggregation and we can nest multiple of them and even nest an aggregation inside a nested aggregation. For example:

{
 "aggs": {
  "popular_tags" : {
   "terms" : {
    "field" : "copies"
   },
   "aggs" : {
    "years" : {
     "terms" : {
      "field" : "year"
     },
     "aggs" : {
      "available_by_year" : {
       "stats" : {
        "field" : "available"
       }
      }
     }
    },
    "available" : {
     "stats" : {
      "field" : "available"
     }
    }
   }
  }
 }
}

As you can see, the possibilities are almost unlimited, if you have enough memory and CPU power to handle very complicated aggregations.

Buckets ordering

There is one more feature about nested aggregations and the ordering of aggregation results. Elasticsearch can use values from the nested aggregations to sort the parent buckets. For example, let's look at the following query:

{
 "aggs": {
  "availability": {
   "terms": {
    "field": "copies",
    "order": { "numbers.avg": "desc" }
   },
   "aggs": {
    "numbers": { "stats" : {} }
   }
  }
 }
}

In the previous example, the order in the availability aggregation is based on the average value from the numbers aggregation. The notation numbers.avg is required in this case, because stats is a multivalued aggregation and provides multiple information and we were interested in the average. If it were the sum aggregation, the name of the aggregation would be sufficient.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset