Elasticsearch is a search engine at the core but what makes it more usable is its ability to make complex data analytics in an easy and simple way. The volume of data is growing rapidly and companies want to perform analysis on data in real time. Whether it is log, real-time streaming of data, or static data, Elasticsearch works wonderfully in getting a summarization of data through its aggregation capabilities.
In this chapter, we will cover the following topics:
The aggregation functionality is completely different from search and enables you to ask sophisticated questions of the data. The use cases of aggregation vary from building analytical reports to getting real-time analysis of data and taking quick actions.
Also, despite being different in functionality, aggregations can operate along the usual search requests. Therefore, you can search or filter your data, and at the same time, you can also perform aggregation on the same datasets matched by search/filter criteria in a single request. A simple example can be to find the maximum number of hashtags used by users related to tweets that has crime in the text field. Aggregations enable you to calculate and summarize data about the current query on the fly. They can be used for all sorts of tasks such as dynamic counting of result values to building a histogram.
Aggregations come in two flavors: metrics and buckets.
Elasticsearch offers a wide variety of buckets to categorize documents in many ways such as by days, age range, popular terms, or locations. However, all of them work on the same principle: document categorization based on some criteria.
The most interesting part is that bucket aggregations can be nested within each other. This means that a bucket can contain other buckets within it. Since each of the buckets defines a set of documents, one can create another aggregation on that bucket, which will be executed in the context of its parent bucket. For example, a country-wise bucket can include a state-wise bucket, which can further include a city-wise bucket.
Aggregation follows the following syntax:
"aggregations" : { "<aggregation_name>" : { "<aggregation_type>" : { <aggregation_body> } [,"aggregations" : { [<sub_aggregation>]+ } ]? } [,"<aggregation_name_2>" : { ... } ]* }
Let's understand how the preceding structure works:
agg
) in the preceding structure holds the aggregations that have to be computed. There can be more than one aggregation inside this object.avg_age
). These logical names will also be used to uniquely identify the aggregations in the response.terms
, sum
, avg
, min
, and so on.avg
aggregation on a specific field will define the field on which the average will be calculated).Look at the following JSON structure to understand a more simple structure of aggregations:
{ "aggs": { "NAME1": { "AGG_TYPE": {}, "aggs": { "NAME": { "AGG_TYPE": {} } } }, "NAME2": { "AGG_TYPE": {} } } }
Aggregations typically work on the values extracted from the aggregated document set. These values can be extracted either from a specific field using the field key inside the aggregation body or can also be extracted using a script.
While it's easy to define a field to be used to aggregate data, the syntax of using scripts needs some special understanding. The benefit of using scripts is that one can combine the values from more than one field to use as a single value inside an aggregation.
The following are the examples of extracting values from a script:
Extracting a value from a single field:
{ "script" : "doc['field_name'].value" }
Extracting and combining values from more than one field:
"script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
The scripts also support the use of parameters using the param
keyword. For example:
{ "avg": { "field": "price", "script": { "inline": "_value * correction", "params": { "correction": 1.5 } } } }
The preceding aggregation calculates the average price after multiplying each value of the price field with 1.5
, which is used as an inline function parameter.
Elasticsearch by default computes aggregations on a complete set of documents using the match_all
query and returns 10 documents by default along with the output of the aggregation results.
If you do not want to include the documents in the response, you need to set the value of the size parameter to 0 inside your query. Note that you do not need to use the from parameter in this case. This is a very useful parameter because it avoids document relevancy calculation and the inclusion of documents in the response, and only returns the aggregated data.