Significant terms aggregation

One of the aggregations introduced after the release of Elasticsearch 1.0 is the significant_terms aggregation that we can use starting from release 1.1. It allows us to get the terms that are relevant and probably the most significant for a given query. The good thing is that it doesn't only show the top terms from the results of the given query, but also shows the one that seems to be the most important one.

The use cases for this aggregation type can vary from finding the most troublesome server working in your application environment to suggesting nicknames from the text. Whenever Elasticsearch can see a significant change in the popularity of a term, such a term is a candidate for being significant.

Note

Please remember that the significant_terms aggregation is marked as experimental and can change or even be removed in the future versions of Elasticsearch.

An example

The best way to describe the significant_terms aggregation type will be through an example. Let's start with indexing 12 simple documents, which represent reviews of work done by interns (commands are also provided in a significant.sh script for easier execution on Linux-based systems):

curl -XPOST 'localhost:9200/interns/review/1' -d '{"intern" :  "Richard", "grade" : "bad", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/2' -d '{"intern" : "Ralf",  "grade" : "perfect", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/3' -d '{"intern" :  "Richard", "grade" : "bad", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/4' -d '{"intern" :  "Richard", "grade" : "bad", "type" : "review"}'
curl -XPOST 'localhost:9200/interns/review/5' -d '{"intern" :  "Richard", "grade" : "good", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/6' -d '{"intern" : "Ralf",  "grade" : "good", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/7' -d '{"intern" : "Ralf",  "grade" : "perfect", "type" : "review"}'
curl -XPOST 'localhost:9200/interns/review/8' -d '{"intern" :  "Richard", "grade" : "medium", "type" : "review"}'
curl -XPOST 'localhost:9200/interns/review/9' -d '{"intern" :  "Monica", "grade" : "medium", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/10' -d '{"intern" :  "Monica", "grade" : "medium", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/11' -d '{"intern" :  "Ralf", "grade" : "good", "type" : "grade"}'
curl -XPOST 'localhost:9200/interns/review/12' -d '{"intern" :  "Ralf", "grade" : "good", "type" : "grade"}'

Of course, to show the real power of the significant_terms aggregation, we should use a way larger dataset. However, for the purpose of this book, we will concentrate on this example, so it is easier to illustrate how this aggregation works.

Now let's try finding the most significant grade for Richard. To do that we will use the following query:

curl -XGET 'localhost:9200/interns/_search?pretty' -d '{
 "query" : {
  "match" : {
   "intern" : "Richard"
  }
 },
 "aggregations" : {
  "description" : {
   "significant_terms" : {
    "field" : "grade"
   }
  }
 }
}'

The result of the preceding query looks as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 1.4054651,
    "hits" : [ {
      "_index" : "interns",
      "_type" : "review",
      "_id" : "4",
      "_score" : 1.4054651,
      "_source":{"intern" : "Richard", "grade" : "bad"}
    }, {
      "_index" : "interns",
      "_type" : "review",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{"intern" : "Richard", "grade" : "bad"}
    }, {
      "_index" : "interns",
      "_type" : "review",
      "_id" : "8",
      "_score" : 1.0,
      "_source":{"intern" : "Richard", "grade" : "medium"}
    }, {
      "_index" : "interns",
      "_type" : "review",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"intern" : "Richard", "grade" : "bad"}
    }, {
      "_index" : "interns",
      "_type" : "review",
      "_id" : "5",
      "_score" : 1.0,
      "_source":{"intern" : "Richard", "grade" : "good"}
    } ]
  },
  "aggregations" : {
    "description" : {
      "doc_count" : 5,
      "buckets" : [ {
        "key" : "bad",
        "doc_count" : 3,
        "score" : 0.84,
        "bg_count" : 3
      } ]
    }
  }
}

As you can see, for our query, Elasticsearch informed us that the most significant grade for Richard is bad. Maybe it wasn't the best internship for him, who knows.

Choosing significant terms

To calculate significant terms, Elasticsearch looks for data that reports significant changes in their popularity between two sets of data: the foreground set and the background set. The foreground set is the data returned by our query, while the background set is the data in our index (or indices, depending on how we run our queries). If a term exists in 10 documents out of 1 million indexed documents, but appears in five documents from 10 returned, such a term is definitely significant and worth concentrating on.

Let's get back to our preceding example now to analyze it a bit. Richard got three grades from the reviewers: bad three times, medium one time, and good one time. From those three, the bad value appears in three out of five documents matching the query. In general, the bad grade appears in three documents (the bg_count property) out of the 12 documents in the index (this is our background set). This gives us 25 percent of the indexed documents. On the other hand, the bad grade appears in three out of five documents matching the query (this is our foreground set), which gives us 60 percent of the documents. As you can see, the change in popularity is significant for the bad grade and that's why Elasticsearch have chosen it to be returned in the significant_terms aggregation results.

Multiple values analysis

Of course, the significant_terms aggregation can be nested and provide us with nice data analysis capabilities that connect two multiple sets of data. For example, let's try to find a significant grade for each of the interns that we have information about. To do that, we will nest the significant_terms aggregation inside the terms aggregation and the query that does that looks as follows:

curl -XGET 'localhost:9200/interns/_search?size=0&pretty' -d '{
 "aggregations" : {
  "grades" : {
   "terms" : {
    "field" : "intern"
   },
   "aggregations" : {
    "significantGrades" : {
     "significant_terms" : {
      "field" : "grade"
     }
    }
   }
  }
 }
}'

The results returned by Elasticsearch for that query are as follows:

{
  "took" : 71,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 12,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "grades" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "ralf",
        "doc_count" : 5,
        "significantGrades" : {
          "doc_count" : 5,
          "buckets" : [ {
            "key" : "good",
            "doc_count" : 3,
            "score" : 0.21000000000000002,
            "bg_count" : 4
          } ]
        }
      }, {
        "key" : "richard",
        "doc_count" : 5,
        "significantGrades" : {
          "doc_count" : 5,
          "buckets" : [ {
            "key" : "bad",
            "doc_count" : 3,
            "score" : 0.6,
            "bg_count" : 3
          } ]
        }
      }, {
        "key" : "monica",
        "doc_count" : 2,
        "significantGrades" : {
          "doc_count" : 2,
          "buckets" : [ ]
        }
      } ]
    }
  }
}

As you can see, we got the results for interns Ralf (key property equals ralf) and Richard (key property equals richard). We didn't get information for Monica though. That's because there wasn't a significant change for the term in the grade field associated with the monica value in the intern field.

Significant terms aggregation and full text search fields

Of course, the significant_terms aggregation can also be used on full text search fields, practically useful for identifying text keywords. The thing is that, running this aggregation of analyzed fields may require a large amount of memory because Elasticsearch will attempt to load every term into the memory.

For example, we could run the significant_terms aggregation against the title field in our library index like the following:

curl -XGET 'localhost:9200/library/_search?size=0&pretty' -d '{
 "query" : {
  "term" : {
   "available" : true
  }
 },
 "aggregations" : {
  "description" : {
   "significant_terms" : {
    "field" : "title"
   }
  }
 }
}'

However, the results wouldn't bring us any useful insight in this case:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "description" : {
      "doc_count" : 4,
      "buckets" : [ {
        "key" : "the",
        "doc_count" : 3,
        "score" : 1.125,
        "bg_count" : 3
      } ]
    }
  }
}

The reason for this is that we don't have large enough data for the results to be meaningful. However, from a logical point of view, the the term is significant for the title field.

Additional configuration options

We could stop here and let you play with the significant_terms aggregation, but we will not. Instead, we will show you a few of the vast configuration options available for this aggregation type so that you can configure internal calculations and adjust it to your needs.

Controlling the number of returned buckets

Elasticsearch allows, how many buckets at maximum we want to have returned in the results. We can control it by using the size property. However, the final bucket list may contain more buckets than we set the size property to. This is the case when the number of unique terms is larger than the specified size property.

If you want to have even more control over the number of returned buckets, you can use the shard_size property. This property specifies how many candidates for significant terms will be returned by each shard. The thing to consider is that usually the low-frequency terms are the ones turning out to be the most interested ones, but Elasticsearch can't see that before merging the results on the aggregation node. Because of this, it is good to keep the shard_size property value higher than the value of the size property.

There is one more thing to remember: if you set the shard_size property lower than the size property, then Elasticsearch will replace the shard_size property with the value of the size property.

Note

Please note that starting from Elasticsearch 1.2.0, if the size or shard_size property is set to 0, Elasticsearch will change that and set it to Integer.MAX_VALUE.

Background set filtering

If you remember, we said that the background set of term frequencies used by the significant_terms aggregation is the whole index or indices. We can alter that behavior by using filter (using the background_filter property) to narrow down the background set. This is useful when we want to find significant terms in a given context.

For example, if we would like to narrow down the background set from our first example only to documents that are the real grades, not reviews, we would add the following term filter to our query:

curl -XGET 'localhost:9200/interns/_search?pretty&size=0' -d '{
 "query" : {
  "match" : {
   "intern" : "Richard"
  }
 },
 "aggregations" : {
  "description" : {
   "significant_terms" : {
    "field" : "grade",
    "background_filter" : {
     "term" : {
      "type" : "grade"
     }
    }
   }
  }
 }
}'

If you would look more closely at the results, you would notice that Elasticsearch calculated the significant terms for a smaller number of documents:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "description" : {
      "doc_count" : 5,
      "buckets" : [ {
        "key" : "bad",
        "doc_count" : 3,
        "score" : 1.02,
        "bg_count" : 2
      } ]
    }
  }
}

Notice that bg_count is now 2 instead of 3 in the initial example. That's because there are only two documents having the bad value in the grade field and matching our filter specified in background_filter.

Minimum document count

A good thing about the significant_terms aggregation is that we can control the minimum number of documents a term needs to be present in to be included as a bucket. We do that by adding the min_doc_count property with the count of our choice.

For example, let's add this parameter to our query that resulted in significant grades for each of our interns. Let's lower the default value of 3 that the min_doc_count property is set to and let's set it to 2. Our modified query would look as follows:

curl -XGET 'localhost:9200/interns/_search?size=0&pretty' -d '{
 "aggregations" : {
  "grades" : {
   "terms" : {
    "field" : "intern"
   },
   "aggregations" : {
    "significantGrades" : {
     "significant_terms" : {
      "field" : "grade",
      "min_doc_count" : 2
     }
    }
   }
  }
 }
}'

The results of the preceding query would be as follows:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 12,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "grades" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "ralf",
        "doc_count" : 5,
        "significantGrades" : {
          "doc_count" : 5,
          "buckets" : [ {
            "key" : "perfect",
            "doc_count" : 2,
            "score" : 0.3200000000000001,
            "bg_count" : 2
          }, {
            "key" : "good",
            "doc_count" : 3,
            "score" : 0.21000000000000002,
            "bg_count" : 4
          } ]
        }
      }, {
        "key" : "richard",
        "doc_count" : 5,
        "significantGrades" : {
          "doc_count" : 5,
          "buckets" : [ {
            "key" : "bad",
            "doc_count" : 3,
            "score" : 0.6,
            "bg_count" : 3
          } ]
        }
      }, {
        "key" : "monica",
        "doc_count" : 2,
        "significantGrades" : {
          "doc_count" : 2,
          "buckets" : [ {
            "key" : "medium",
            "doc_count" : 2,
            "score" : 1.0,
            "bg_count" : 3
          } ]
        }
      } ]
    }
  }
}

As you can see, the results differ from the original example—this is because the constraints on the significant terms have been lowered. Of course, that also means that our results may be worse now. Setting this parameter to 1 may result in typos and strange words being included in the results and is generally not advised.

There is one thing to remember when it comes to using the min_doc_count property. During the first phase of aggregation calculation, Elasticsearch will collect the highest scoring terms on each shard included in the process. However, because shard doesn't have the information about the global term frequencies, the decision about term being a candidate to a significant terms list is based on shard term frequencies. The min_doc_count property is applied during the final stage of the query, once all the results are merged from the shards. Because of this, it may happen that high-frequency terms are missing in the significant terms list and the list is populated by high-scoring terms instead. To avoid this, you can increase the shard_size property and the cost of memory consumption and higher network usage.

Execution hint

Elasticsearch allows us to specify execution mode, which should be used to calculate the significant_terms aggregation. Depending on the situation, we can either set the execution_hint property to map or to ordinal. The first execution type tells Elasticsearch to aggregate the data per bucket using the values themselves. The second value tells Elasticsearch to use ordinals of the values instead of the values themselves. In most situations, setting the execution_hint property to ordinal should result in slightly faster execution, but the data we are working on must expose the ordinals. However, if the fields you calculate the significant_terms aggregation on is high cardinality one (if it contains a high number of unique terms), then using map is, in most cases, a better choice.

Note

Please note that Elasticsearch will ignore the execution_hint property if it can't be applied.

More options

Because Elasticsearch is constantly being developed and changed, we decided not to include all the options that are possible to set. We also omitted the options that we think are very rarely used by the users so that we are able to write in further detail about more commonly used features. See the full list of options at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html.

There are limits

While we were working on this book, there were a few limitations when it comes to the significant_terms aggregation. Of course, those are no showstoppers that will force you to totally forget about that aggregation, but it is useful to know about them.

Memory consumption

Because the significant_terms aggregation works on indexed values, it needs to load all the unique terms into the memory to be able to do its job. Because of this, you have to be careful when using this aggregation on large indices and on fields that are analyzed. In addition to this, we can't lower the memory consumption by using doc values fields because the significant_terms aggregation doesn't support them.

Shouldn't be used as top-level aggregation

The significant_terms aggregation shouldn't be used as a top-level aggregation whenever you are using the match_all query, its equivalent returning all the documents or no query at all. In such cases, the foreground and background sets will be the same, and Elasticsearch won't be able to calculate the differences in frequencies. This means that no significant terms will be found.

Counts are approximated

Elasticsearch approximates the counts of how many documents contain a term based on the information returned for each shard. You have to be aware of that because this means that those counts can be miscalculated in certain situations (for example, count can be approximated too low when shards didn't include data for a given term in the top samples returned). As the documentation states, this was a design decision to allow faster execution at the cost of potentially small inaccuracies.

Floating point fields are not allowed

Fields that are floating point type-based ones are not allowed as the subject of calculation of the significant_terms aggregation. You can use the long or integer based fields though.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset