Chapter 7. Aggregations for Data Analysis

In the previous chapter, we discussed the querying side of Elasticsearch again. We learned how the Lucene TF/IDF algorithm works and how to use Elasticsearch scripting capabilities. We handled multilingual data and influenced document scores with boosts. We used synonyms to match words that have the same meaning and we used Elasticsearch Explain API to see how document scores were calculated. By the end of this chapter, you will have learned the following topics:

  • What are aggregations
  • How the Elasticsearch aggregation engine works
  • How to use metrics aggregations
  • How to use buckets aggregations
  • How to use pipeline aggregations

Aggregations

Introduced in Elasticsearch 1.0, aggregations are the heart of data analytics in Elasticsearch. Highly flexible and performant, aggregations brought Elasticsearch 1.0 to a new position as a full-featured analysis engine. Extended through the life of Elasticsearch 1.x, in 2.x they are yet more powerful, less memory demanding, and faster. With this framework, you can use Elasticsearch as the analysis engine for data extraction and visualization. Let's see how that functionality works and what we can achieve by using it.

General query structure

To use aggregations, we need to add an additional section in our query. In general, our queries with aggregations look like this:

{
   "query": { … },
   "aggs" : {
     "aggregation_name" : {
       "aggregation_type" : {
         ...
       }
     }
   }
}

In the aggs property (you can use aggregations if you want; aggs is just an abbreviation), you can define any number of aggregations. Each aggregation is defined by its name and one of the types of aggregations that are provided by Elasticsearch. One thing to remember though is that the key defines the name of the aggregation (you will need it to distinguish particular aggregations in the server response). Let's take our library index and create the first query using use aggregations. A command sending such a query looks like this:

curl 'localhost:9200/library/_search?search_type=query_then_fetch&size=0&pretty' -d '{
   "aggs": {
      "years": {
         "stats": {
            "field": "year"
         }
      },
      "words": {
         "terms": {
            "field": "copies"
         }
      }
   }
}'

This query defines two aggregations. The aggregation named years shows statistics for the year field. The words aggregation contains information about the terms used in a given field.

Note

In our examples we assumed that we perform aggregation in addition to searching. If we don't need found documents, a better idea is to use the size parameter and set it to 0. This omits some unnecessary work and is more efficient. In such a case, the endpoint should be /library/_search?size=0. You can read more about search types in Chapter 3, Understanding the Querying Process.

Let's now look at the response returned by Elasticsearch for the preceding query:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "words" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : 0,
        "doc_count" : 2
      }, {
        "key" : 1,
        "doc_count" : 1
      }, {
        "key" : 6,
        "doc_count" : 1
      } ]
    },
    "years" : {
      "count" : 4,
      "min" : 1886.0,
      "max" : 1961.0,
      "avg" : 1928.0,
      "sum" : 7712.0
    }
  }
}

As you see, both the aggregations (years and words) were returned. The first aggregation we defined in our query (years) returned general statistics for the given field gathered across all the documents that matched our query. The second of the defined aggregations (words) was a bit different. It created several sets called buckets that were calculated on the returned documents and each of the aggregated values was within one of these sets. As you can see, there are multiple aggregation types available and they return different results. We will see the differences in the later part of this section.

The great thing about the aggregation engine is that it allows you to have multiple aggregations and that aggregations can be nested. This means that you can have indefinite levels of nesting and any number of aggregations in general. The extended structure of the query is shown next:

{
   "query": { … },
   "aggs" : {
     "first_aggregation_name" : {
       "aggregation_type" : {
         ...
       },
    "aggregations" : {
         "first_nested_aggregation" : {
         ...
         },
         .
         .
         .
         "nth_nested_aggregation" : {
         ...
         }
       }
     },
     .
     .
     .
     "nth_aggregation_name" : {
     ...
     }
   }
}

Inside the aggregations engine

Aggregations work on the basis of results returned by the query. This is very handy as we get the information that we are interested in, both from the query as well as the data analysis perspective. So what does Elasticsearch do when we include the aggregation part of the query in the request that we send to Elasticsearch? First of all, the aggregation is executed on each relevant shard and the results are returned to the node that is responsible for running that query. That node waits for the partial results to be calculated; after it gets all the results, it merges the results, producing the final results.

This approach is nothing new when it comes to distributed systems and how they work and communicate, but can cause issues when it comes to the precision of the results. In most cases this is not a problem, but you should be aware about what to expect. Let's imagine the following example:

Inside the aggregations engine

The preceding image shows a simplified view of three shards, each containing documents having only Elasticsearch and Solr terms in them. Now imagine that we are interested in a single term for our index. The terms aggregation when run using size=1 would return a single term, that would be the one that is the most frequent (of course limited to the query we've run). So our aggregator node would see partial results telling us that Elasticsearch is present in 19 documents in Shard 1 and the Solr term is present in 10 documents in Shard 2 and Shard 3, which means that the top term is Solr, which is not true. This is an extreme case, but there are use cases (such as accounting) where precision is key and you should be aware about such situations.

Note

Compared to queries, aggregations are heavier for Elasticsearch in terms of both CPU cycles and memory consumption. We will discuss this in more detail in the Caching Aggregations section of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset