Similar to metric aggregations, bucket
aggregations are also categorized into two forms: Single buckets that contain only a single bucket in the response, and multi buckets that contain more than one bucket in the response.
The following are the most important aggregations that are used to create buckets:
Buckets
aggregation response formats are different from the response formats of metric aggregations. The response of a bucket
aggregation usually comes in the following format:
"aggregations": { "aggregation_name": { "buckets": [ { "key": value, "doc_count": value }, ...... ] } }
All the bucket aggregations can be created in Java using the AggregationBuilder
and AggregationBuilders
classes. You need to have the following classes imported inside your code for the same:
org.elasticsearch.search.aggregations.AggregationBuilder; org.elasticsearch.search.aggregations.AggregationBuilders;
Also, all the aggregation queries can be executed with the following code snippet:
SearchResponse response = client.prepareSearch(indexName).setTypes(docType) .setQuery(QueryBuilders.matchAllQuery()) .addAggregation(aggregation) .execute().actionGet();
The setQuery()
method can take any type of Elasticsearch query, whereas the addAggregation()
method takes the aggregation built using AggregationBuilder
.
Terms aggregation is the most widely used aggregation type and returns the buckets that are dynamically built using one per unique value.
Let's see how to find the top 10 hashtags used in our Twitter index in descending order.
Python example
query = {
"aggs": {
"top_hashtags": {
"terms": {
"field": "entities.hashtags.text",
"size": 10,
"order": {
"_term": "desc"
}
}
}
}
}
In the preceding example, the size parameter controls how many buckets are to be returned (defaults to 10) and the order parameter controls the sorting of the bucket
terms (defaults to asc):
res = es.search(index='twitter', doc_type='tweets', body=query)
The response would look like this:
"aggregations": { "top_hashtags": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 44, "buckets": [ { "key": "politics", "doc_count": 2 }, …............. ] } }
Java example
Terms
aggregation can be built as follows:
AggregationBuilder aggregation =
AggregationBuilders.terms("agg").field(fieldName)
.size(10);
Here, agg
is the aggregation bucket name and fieldName
is the field on which the aggregation is performed.
The response object can be parsed as follows:
To parse the terms aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
Then, the response can be parsed with the following code snippet:
Terms screen_names = response.getAggregations().get("agg"); for (Terms.Bucket entry : screen_names.getBuckets()) { entry.getKey(); // Term entry.getDocCount(); // Doc count }
With range aggregation, a user can specify a set of ranges, where each range represents a bucket. Elasticsearch will put the document sets into the correct buckets by extracting the value from each document and matching it against the specified ranges.
Python example
query = "aggs": {
"status_count_ranges": {
"range": {
"field": "user.statuses_count",
"ranges": [
{
"to": 50
},
{
"from": 50,
"to": 100
}
]
}
}
},"size": 0
}
res = es.search(index='twitter', doc_type='tweets', body=query)
The response for the preceding query request would look like this:
"aggregations": { "status_count_ranges": { "buckets": [ { "key": "*-50.0", "to": 50, "to_as_string": "50.0", "doc_count": 3 }, { "key": "50.0-100.0", "from": 50, "from_as_string": "50.0", "to": 100, "to_as_string": "100.0", "doc_count": 3 } ] } }
Building range aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.range("agg")
.field(fieldName)
.addUnboundedTo(1) // from -infinity to 1 (excluded)
.addRange(1, 100) // from 1 to 100(excluded)
.addUnboundedFrom(100); // from 100 to +infinity
Here, agg
is the aggregation bucket name and fieldName
is the field on which the aggregation is performed. The addUnboundedTo
method is used when you do not specify the from
parameter and the addUnboundedFrom
method is used when you don't specify the to
parameter.
Parsing the response
To parse the range
aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.range.Range;
Then, the response can be parsed with the following code snippet:
Range agg = response.getAggregations().get("agg"); for (Range.Bucket entry : agg.getBuckets()) { String key = entry.getKeyAsString(); // Range as key Number from = (Number) entry.getFrom(); // Bucket from Number to = (Number) entry.getTo(); // Bucket to long docCount = entry.getDocCount(); // Doc count }
The date range
aggregation is dedicated for date fields and is similar to range aggregation. The only difference between range and date range aggregation is that the latter allows you to use a date math
expression inside the from and to fields. The following table shows an example of using math operations in Elasticsearch. The supported time units for the math operations are: y (year), M (month), w (week), d (day), h (hour), m (minute), and s (second):
Operation |
Description |
---|---|
Now |
Current time |
Now+1h |
Current time plus 1 hour |
Now-1M |
Current time minus 1 month |
Now+1h+1m |
Current time plus 1 hour plus one minute |
Now+1h/d |
Current time plus 1 hour rounded to the nearest day |
2016-01-01||+1M/d |
2016-01-01 plus 1 month rounded to the nearest day |
Python example
query = {
"aggs": {
"tweets_creation_interval": {
"range": {
"field": "created_at",
"format": "yyyy",
"ranges": [
{
"to": 2000
},
{
"from": 2000,
"to": 2005
},
{
"from": 2005
}
]
}
}
},"size": 0
}
res = es.search(index='twitter', doc_type='tweets', body=query)
print res
Building date range aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.dateRange("agg")
.field(fieldName)
.format("yyyy")
.addUnboundedTo("2000") // from -infinity to 2000 (excluded)
.addRange("2000", "2005") // from 2000 to 2005 (excluded)
.addUnboundedFrom("2005"); // from 2005 to +infinity
Here, agg
is the aggregation bucket
name and fieldName
is the field on which the aggregation is performed. The addUnboundedTo
method is used when you do not specify the from
parameter and the addUnboundedFrom
method is used when you don't specify the to
parameter.
Parsing the response:
To parse the date range
aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.range.Range; import org.joda.time.DateTime;
Then, the response can be parsed with the following code snippet:
Range agg = response.getAggregations().get("agg"); for (Range.Bucket entry : agg.getBuckets()) { String key = entry.getKeyAsString(); // Date range as key DateTime fromAsDate = (DateTime) entry.getFrom(); // Date bucket from as a Date DateTime toAsDate = (DateTime) entry.getTo(); // Date bucket to as a Date long docCount = entry.getDocCount(); // Doc count }
A histogram
aggregation works on numeric values extracted from documents and creates fixed-sized buckets based on those values. Let's see an example for creating buckets of a user's favorite tweet counts:
Python example
query = {
"aggs": {
"favorite_tweets": {
"histogram": {
"field": "user.favourites_count",
"interval": 20000
}
}
},"size": 0
}
res = es.search(index='twitter', doc_type='tweets', body=query)
for bucket in res['aggregations']['favorite_tweets']['buckets']:
print bucket['key'], bucket['doc_count']
The response for the preceding query will look like the following, which says that 114 users have favorite tweets between 0
to 20000
and 8
users have more than 20000
as their favorite tweets:
"aggregations": { "favorite_tweets": { "buckets": [ { "key": 0, "doc_count": 114 }, { "key": 20000, "doc_count": 8 } ] } }
Building histogram aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.histogram("agg")
.field(fieldName)
.interval(5);
Here, agg
is the aggregation bucket
name and fieldName
is the field on which aggregation is performed. The interval
method is used to pass the interval for generating the buckets.
Parsing the response:
To parse the histogram aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.histogram.Histogram;
Then, the response can be parsed with the following code snippet:
Range agg = response.getAggregations().get("agg"); for (Histogram.Bucket entry : agg.getBuckets()) { Long key = (Long) entry.getKey(); // Key long docCount = entry.getDocCount(); // Doc coun }
Date histogram is similar to the histogram
aggregation but it can only be applied to date fields. The difference between the two is that date histogram allows you to specify intervals using date/time expressions.
The following values can be used for intervals:
You can also specify fractional values, such as 1h (1 hour), 1m (1 minute) and so on.
Date histograms are mostly used to generate time-series graphs in many applications.
Python example
query = {
"aggs": {
"tweet_histogram": {
"date_histogram": {
"field": "created_at",
"interval": "hour"
}
}
}, "size": 0
}
The preceding aggregation will generate an hourly-based tweet timeline on the field, created_at
:
res = es.search(index='twitter', doc_type='tweets', body=query) for bucket in res['aggregations']['tweet_histogram']['buckets']: print bucket['key'], bucket['key_as_string'], bucket['doc_count']
Building date histogram aggregation:
AggregationBuilder aggregation = AggregationBuilders .histogram("agg") .field(fieldName) .interval(DateHistogramInterval.YEAR);
Here, agg
is the aggregation bucket
name and fieldname
is the field on which the aggregation is performed. The interval
method is used to pass the interval to generate buckets. For interval in days, you can do this: DateHistogramInterval.days(10)
Parsing the response:
To parse the date histogram
aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval;
The response can be parsed with this code snippet:
Histogram agg = response.getAggregations().get("agg"); for (Histogram.Bucket entry : agg.getBuckets()) { DateTime key = (DateTime) entry.getKey(); // Key String keyAsString = entry.getKeyAsString(); // Key as String long docCount = entry.getDocCount(); // Doc count }
Elasticsearch allows filters to be used as aggregations too. Filters preserve their behavior in the aggregation context as well and are usually used to narrow down the current aggregation context to a specific set of documents. You can use any filter such as range
, term
, geo
, and so on.
To get the count of all the tweets done by the user, d_bharvi
, use the following code:
Python example
query = {
"aggs": {
"screename_filter": {
"filter": {
"term": {
"user.screen_name": "d_bharvi"
}
}
}
},"size": 0
}
In the preceding request, we have used a term filter to narrow down the bucket of tweets done by a particular user:
res = es.search(index='twitter', doc_type='tweets', body=query) for bucket in res['aggregations']['screename_filter']['buckets']: print bucket['doc_count']
The response would look like this:
"aggregations": { "screename_filter": { "doc_count": 100 } } }
Building filter-based aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.filter("agg")
.filter(QueryBuilders.termQuery("user.screen_name ", "d_bharvi"));
Here, agg
is the aggregation bucket name under the first filter method and the second filter method takes a query to apply the filter.
Parsing the response:
To parse a filter-based aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramInterval;
The response can be parsed with the following code snippet:
Filter agg = response.getAggregations().get("agg"); agg.getDocCount(); // Doc count