One of the aggregations introduced after the release of Elasticsearch 1.0 is the significant_terms
aggregation that we can use starting from release 1.1. It allows us to get the terms that are relevant and probably the most significant for a given query. The good thing is that it doesn't only show the top terms from the results of the given query, but also shows the one that seems to be the most important one.
The use cases for this aggregation type can vary from finding the most troublesome server working in your application environment to suggesting nicknames from the text. Whenever Elasticsearch can see a significant change in the popularity of a term, such a term is a candidate for being significant.
The best way to describe the significant_terms
aggregation type will be through an example. Let's start with indexing 12 simple documents, which represent reviews of work done by interns (commands are also provided in a significant.sh
script for easier execution on Linux-based systems):
curl -XPOST 'localhost:9200/interns/review/1' -d '{"intern" : "Richard", "grade" : "bad", "type" : "grade"}' curl -XPOST 'localhost:9200/interns/review/2' -d '{"intern" : "Ralf", "grade" : "perfect", "type" : "grade"}' curl -XPOST 'localhost:9200/interns/review/3' -d '{"intern" : "Richard", "grade" : "bad", "type" : "grade"}' curl -XPOST 'localhost:9200/interns/review/4' -d '{"intern" : "Richard", "grade" : "bad", "type" : "review"}' curl -XPOST 'localhost:9200/interns/review/5' -d '{"intern" : "Richard", "grade" : "good", "type" : "grade"}' curl -XPOST 'localhost:9200/interns/review/6' -d '{"intern" : "Ralf", "grade" : "good", "type" : "grade"}' curl -XPOST 'localhost:9200/interns/review/7' -d '{"intern" : "Ralf", "grade" : "perfect", "type" : "review"}' curl -XPOST 'localhost:9200/interns/review/8' -d '{"intern" : "Richard", "grade" : "medium", "type" : "review"}' curl -XPOST 'localhost:9200/interns/review/9' -d '{"intern" : "Monica", "grade" : "medium", "type" : "grade"}' curl -XPOST 'localhost:9200/interns/review/10' -d '{"intern" : "Monica", "grade" : "medium", "type" : "grade"}' curl -XPOST 'localhost:9200/interns/review/11' -d '{"intern" : "Ralf", "grade" : "good", "type" : "grade"}' curl -XPOST 'localhost:9200/interns/review/12' -d '{"intern" : "Ralf", "grade" : "good", "type" : "grade"}'
Of course, to show the real power of the significant_terms
aggregation, we should use a way larger dataset. However, for the purpose of this book, we will concentrate on this example, so it is easier to illustrate how this aggregation works.
Now let's try finding the most significant grade for Richard. To do that we will use the following query:
curl -XGET 'localhost:9200/interns/_search?pretty' -d '{ "query" : { "match" : { "intern" : "Richard" } }, "aggregations" : { "description" : { "significant_terms" : { "field" : "grade" } } } }'
The result of the preceding query looks as follows:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 5, "max_score" : 1.4054651, "hits" : [ { "_index" : "interns", "_type" : "review", "_id" : "4", "_score" : 1.4054651, "_source":{"intern" : "Richard", "grade" : "bad"} }, { "_index" : "interns", "_type" : "review", "_id" : "3", "_score" : 1.0, "_source":{"intern" : "Richard", "grade" : "bad"} }, { "_index" : "interns", "_type" : "review", "_id" : "8", "_score" : 1.0, "_source":{"intern" : "Richard", "grade" : "medium"} }, { "_index" : "interns", "_type" : "review", "_id" : "1", "_score" : 1.0, "_source":{"intern" : "Richard", "grade" : "bad"} }, { "_index" : "interns", "_type" : "review", "_id" : "5", "_score" : 1.0, "_source":{"intern" : "Richard", "grade" : "good"} } ] }, "aggregations" : { "description" : { "doc_count" : 5, "buckets" : [ { "key" : "bad", "doc_count" : 3, "score" : 0.84, "bg_count" : 3 } ] } } }
As you can see, for our query, Elasticsearch informed us that the most significant grade for Richard is bad
. Maybe it wasn't the best internship for him, who knows.
To calculate significant terms, Elasticsearch looks for data that reports significant changes in their popularity between two sets of data: the foreground set and the background set. The foreground set is the data returned by our query, while the background set is the data in our index (or indices, depending on how we run our queries). If a term exists in 10 documents out of 1 million indexed documents, but appears in five documents from 10 returned, such a term is definitely significant and worth concentrating on.
Let's get back to our preceding example now to analyze it a bit. Richard got three grades from the reviewers: bad
three times, medium
one time, and good
one time. From those three, the bad
value appears in three out of five documents matching the query. In general, the bad
grade appears in three documents (the bg_count
property) out of the 12 documents in the index (this is our background set). This gives us 25 percent of the indexed documents. On the other hand, the bad
grade appears in three out of five documents matching the query (this is our foreground set), which gives us 60 percent of the documents. As you can see, the change in popularity is significant for the bad
grade and that's why Elasticsearch have chosen it to be returned in the significant_terms
aggregation results.
Of course, the significant_terms
aggregation can be nested and provide us with nice data analysis capabilities that connect two multiple sets of data. For example, let's try to find a significant grade for each of the interns that we have information about. To do that, we will nest the significant_terms
aggregation inside the terms
aggregation and the query that does that looks as follows:
curl -XGET 'localhost:9200/interns/_search?size=0&pretty' -d '{ "aggregations" : { "grades" : { "terms" : { "field" : "intern" }, "aggregations" : { "significantGrades" : { "significant_terms" : { "field" : "grade" } } } } } }'
The results returned by Elasticsearch for that query are as follows:
{ "took" : 71, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 12, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "grades" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "ralf", "doc_count" : 5, "significantGrades" : { "doc_count" : 5, "buckets" : [ { "key" : "good", "doc_count" : 3, "score" : 0.21000000000000002, "bg_count" : 4 } ] } }, { "key" : "richard", "doc_count" : 5, "significantGrades" : { "doc_count" : 5, "buckets" : [ { "key" : "bad", "doc_count" : 3, "score" : 0.6, "bg_count" : 3 } ] } }, { "key" : "monica", "doc_count" : 2, "significantGrades" : { "doc_count" : 2, "buckets" : [ ] } } ] } } }
As you can see, we got the results for interns Ralf (key
property equals ralf
) and Richard (key
property equals richard
). We didn't get information for Monica though. That's because there wasn't a significant change for the term in the grade
field associated with the monica
value in the intern
field.
Of course, the significant_terms
aggregation can also be used on full text search fields, practically useful for identifying text keywords. The thing is that, running this aggregation of analyzed fields may require a large amount of memory because Elasticsearch will attempt to load every term into the memory.
For example, we could run the significant_terms
aggregation against the title field in our library index like the following:
curl -XGET 'localhost:9200/library/_search?size=0&pretty' -d '{ "query" : { "term" : { "available" : true } }, "aggregations" : { "description" : { "significant_terms" : { "field" : "title" } } } }'
However, the results wouldn't bring us any useful insight in this case:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 4, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "description" : { "doc_count" : 4, "buckets" : [ { "key" : "the", "doc_count" : 3, "score" : 1.125, "bg_count" : 3 } ] } } }
The reason for this is that we don't have large enough data for the results to be meaningful. However, from a logical point of view, the the
term is significant for the title
field.
We could stop here and let you play with the significant_terms
aggregation, but we will not. Instead, we will show you a few of the vast configuration options available for this aggregation type so that you can configure internal calculations and adjust it to your needs.
Elasticsearch allows, how many buckets at maximum we want to have returned in the results. We can control it by using the size
property. However, the final bucket list may contain more buckets than we set the size
property to. This is the case when the number of unique terms is larger than the specified size
property.
If you want to have even more control over the number of returned buckets, you can use the shard_size
property. This property specifies how many candidates for significant terms will be returned by each shard. The thing to consider is that usually the low-frequency terms are the ones turning out to be the most interested ones, but Elasticsearch can't see that before merging the results on the aggregation node. Because of this, it is good to keep the shard_size
property value higher than the value of the size
property.
There is one more thing to remember: if you set the shard_size
property lower than the size
property, then Elasticsearch will replace the shard_size
property with the value of the size
property.
If you remember, we said that the background set of term frequencies used by the significant_terms
aggregation is the whole index or indices. We can alter that behavior by using filter (using the background_filter
property) to narrow down the background set. This is useful when we want to find significant terms in a given context.
For example, if we would like to narrow down the background set from our first example only to documents that are the real grades, not reviews, we would add the following term
filter to our query:
curl -XGET 'localhost:9200/interns/_search?pretty&size=0' -d '{ "query" : { "match" : { "intern" : "Richard" } }, "aggregations" : { "description" : { "significant_terms" : { "field" : "grade", "background_filter" : { "term" : { "type" : "grade" } } } } } }'
If you would look more closely at the results, you would notice that Elasticsearch calculated the significant terms for a smaller number of documents:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 5, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "description" : { "doc_count" : 5, "buckets" : [ { "key" : "bad", "doc_count" : 3, "score" : 1.02, "bg_count" : 2 } ] } } }
Notice that bg_count
is now 2
instead of 3
in the initial example. That's because there are only two documents having the bad
value in the grade
field and matching our filter specified in background_filter
.
A good thing about the significant_terms
aggregation is that we can control the minimum number of documents a term needs to be present in to be included as a bucket. We do that by adding the min_doc_count
property with the count of our choice.
For example, let's add this parameter to our query that resulted in significant grades for each of our interns. Let's lower the default value of 3 that the min_doc_count
property is set to and let's set it to 2
. Our modified query would look as follows:
curl -XGET 'localhost:9200/interns/_search?size=0&pretty' -d '{ "aggregations" : { "grades" : { "terms" : { "field" : "intern" }, "aggregations" : { "significantGrades" : { "significant_terms" : { "field" : "grade", "min_doc_count" : 2 } } } } } }'
The results of the preceding query would be as follows:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 12, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "grades" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "ralf", "doc_count" : 5, "significantGrades" : { "doc_count" : 5, "buckets" : [ { "key" : "perfect", "doc_count" : 2, "score" : 0.3200000000000001, "bg_count" : 2 }, { "key" : "good", "doc_count" : 3, "score" : 0.21000000000000002, "bg_count" : 4 } ] } }, { "key" : "richard", "doc_count" : 5, "significantGrades" : { "doc_count" : 5, "buckets" : [ { "key" : "bad", "doc_count" : 3, "score" : 0.6, "bg_count" : 3 } ] } }, { "key" : "monica", "doc_count" : 2, "significantGrades" : { "doc_count" : 2, "buckets" : [ { "key" : "medium", "doc_count" : 2, "score" : 1.0, "bg_count" : 3 } ] } } ] } } }
As you can see, the results differ from the original example—this is because the constraints on the significant terms have been lowered. Of course, that also means that our results may be worse now. Setting this parameter to 1
may result in typos and strange words being included in the results and is generally not advised.
There is one thing to remember when it comes to using the min_doc_count
property. During the first phase of aggregation calculation, Elasticsearch will collect the highest scoring terms on each shard included in the process. However, because shard doesn't have the information about the global term frequencies, the decision about term being a candidate to a significant terms list is based on shard term frequencies. The min_doc_count
property is applied during the final stage of the query, once all the results are merged from the shards. Because of this, it may happen that high-frequency terms are missing in the significant terms list and the list is populated by high-scoring terms instead. To avoid this, you can increase the shard_size
property and the cost of memory consumption and higher network usage.
Elasticsearch allows us to specify execution mode, which should be used to calculate the significant_terms
aggregation. Depending on the situation, we can either set the execution_hint
property to map
or to ordinal
. The first execution type tells Elasticsearch to aggregate the data per bucket using the values themselves. The second value tells Elasticsearch to use ordinals of the values instead of the values themselves. In most situations, setting the execution_hint
property to ordinal
should result in slightly faster execution, but the data we are working on must expose the ordinals. However, if the fields you calculate the significant_terms
aggregation on is high cardinality one (if it contains a high number of unique terms), then using map
is, in most cases, a better choice.
Because Elasticsearch is constantly being developed and changed, we decided not to include all the options that are possible to set. We also omitted the options that we think are very rarely used by the users so that we are able to write in further detail about more commonly used features. See the full list of options at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html.
While we were working on this book, there were a few limitations when it comes to the significant_terms
aggregation. Of course, those are no showstoppers that will force you to totally forget about that aggregation, but it is useful to know about them.
Because the
significant_terms
aggregation works on indexed values, it needs to load all the unique terms into the memory to be able to do its job. Because of this, you have to be careful when using this aggregation on large indices and on fields that are analyzed. In addition to this, we can't lower the memory consumption by using doc values fields because the significant_terms
aggregation doesn't support them.
The significant_terms
aggregation shouldn't be used as a top-level aggregation whenever you are using the match_all
query, its equivalent returning all the documents or no query at all. In such cases, the foreground and background sets will be the same, and Elasticsearch won't be able to calculate the differences in frequencies. This means that no significant terms will be found.
Elasticsearch approximates the counts of how many documents contain a term based on the information returned for each shard. You have to be aware of that because this means that those counts can be miscalculated in certain situations (for example, count can be approximated too low when shards didn't include data for a given term in the top samples returned). As the documentation states, this was a design decision to allow faster execution at the cost of potentially small inaccuracies.