In the previous chapter we've looked at how to index tree-like structures, how to use nested objects, and the parent-child relationship. We've also discussed fetching data from external systems using the river plugins and batch processing to speed up indexing. We've seen how to use the index update API. In this chapter, we will learn how to use faceting, which will allow us to get aggregated data about our search results. We will also see how to get similar documents with the "more like this" functionality and how to use prospective search to store queries, not documents. By the end of this chapter you will have learned:
ElasticSearch is a full text search engine that aims to provide search results on the basis of our queries. However, sometimes we would like to get more. For example, we would like to get aggregated data that is calculated on the result set we get, such as the number of documents priced between 100 and 200 dollars or the most common tags in the results documents. In order to do that, ElasticSearch provides a faceting module that is responsible for providing such data. In this chapter we will discuss different faceting methods provided by ElasticSearch.
For the purpose of discussing faceting, we'll use a very simple index structure for our documents. It will contain the identifier of the document, document date, a multivalued field that can hold words describing our document (the tags
field), and a field holding numeric information (the total
field). Our mappings could look like this:
{ "mappings" : { "doc" : { "properties" : { "id" : { "type" : "long", "store" : "yes" }, "date" : { "type" : "date", "store" : "no" }, "tags" : { "type" : "string", "store" : "no", "index" : "not_analyzed" }, "total" : { "type" : "long", "store" : "no" } } } } }
Before we get into how to run faceting, let's take a look on what to expect from ElasticSearch as a result of faceting requests. In most cases you'll only be interested in the data specific to the faceting type. However in most faceting types, in addition to information specific to a given faceting type, you'll get the following as well:
_type
: This specifies the faceting type used and will be provided for each faceting typemissing
: This specifies the number of documents that didn't have enough data (for example, a missing field) to calculate facetingtotal
: This specifies the number of tokens in the facet calculationother
: This specifies the number of facet values (for example, terms in terms faceting) not included in the returned countsIn addition to these types, you'll get an array of calculated facets such as count
for your terms, queries, or spatial distance. For example, this is what the usual faceting results look like:
{ . . . "facets" : { "tags" : { "_type" : "terms", "missing" : 54715, "total" : 151266, "other" : 143140, "terms" : [ { "term" : "test", "count" : 1119 }, { "term" : "personal", "count" : 1063 }, (...) ] } } }
As you can see in the results, our faceting was run against the tags
field. We've got a total number of 151,266 tokens processed by the faceting module, and 143,140 that were not included in the results. We also have 54,715 documents that didn't have the value in the tags
field. The term "test" appeared in 1,119 documents and the term "personal" appeared in 1,063 documents. This is what you can expect from a faceting response.
Query is one of the simplest faceting types that allows us to get the number of documents that match the query in the faceting results. The query itself can be expressed using the ElasticSearch query language, which we discussed in Chapter 2, Searching Your Data. For example, faceting that would return the number of documents for a simple term query could look like this:
{ "query" : { "match_all" : {} }, "facets" : { "my_query_facet" : { "query" : { "term" : { "tags" : "personal" } } } } }
As you can see, we've included the query
type faceting with a simple term query.
A sample response for the preceding query could look like this:
{ . . . "facets" : { "my_query_facet" : { "_type" : "query", "count" : 1081 } } }
As you can see, in the response, we've got the faceting type and the count of the documents that matched the facet query, and of course, the main query.
A filter is a simple faceting type that allows us to get the number of documents that match the filter. The filter itself can be expressed using the ElasticSearch query language. For example, faceting that would return the number of documents for a simple term filter could look like this:
{ "query" : { "match_all" : {} }, "facets" : { "my_filter_facet" : { "filter" : { "term" : { "tags" : "personal" } } } } }
As you can see, we've included the filter
type faceting with a simple term filter. When talking about performance, filter facets are faster than query facets or filter facets that wrap queries.
An example response for the preceding query could look like this:
{ . . . "facets" : { "my_filter_facet" : { "_type" : "filter", "count" : 1081 } } }
As you can see in the response, we've got the faceting type and the count of the documents that matched the facet filter, and of course, the main query.
Terms faceting allows specifying a field, and ElasticSearch will return the most frequent terms for that field. For example, if we want to calculate terms faceting for the tags
field, we could run the following query:
{ "query" : { "match_all" : {} }, "facets" : { "tags_facet_result" : { "terms" : { "field" : "tags" } } } }
The following faceting response will be returned by ElasticSearch for the preceding query:
{ . . . "facets" : { "tags_facet_result" : { "_type" : "terms", "missing" : 54716, "total" : 151266, "other" : 143140, "terms" : [ { "term" : "test", "count" : 1119 }, { "term" : "personal", "count" : 1063 }, { "term" : "feel", "count" : 982 }, { "term" : "hot", "count" : 923 }, (...) ] } } }
As you can see, our terms faceting results were returned in the tags_facet_result
section and we've got the information that was already described.
There are a few additional parameters we can use for terms
faceting:
size
: This specifies how many most frequent terms should be returned at most. The documents with subsequent terms will be included in the count of the other
field in the result.order
: This specifies the faceting ordering. The possible values are:count
(the default order by frequency, starting from the most frequent)term
(alphabetical order, ascending)reverse_count
(order by frequency, starting from the less frequent)reverse_term
(alphabetical order, descending)all_terms
: This parameter, when set to true
, will return all the terms in the result.exclude
: This is an array of terms that should be excluded from facet calculation.regex
: This is a regular expression that will control the terms to be included in the calculation.script
: This specifies the script that will be used to process the term used in facet calculation.fields
: This is an array that allows specifying multiple fields for faceting calculation (which should be used instead of the field
property). ElasticSearch will return aggregation across multiple fields._script_field
: This specifies the script that will provide the actual term for calculation. For example, any term based from the _source
field may be used.Range faceting allows us to get the number of documents for a defined set of ranges, and in addition to that, get data aggregated for the specified field. For example, if we wanted to get the number of documents that have the value in the total
field falling into the ranges (lower bound inclusive, upper exclusive) up to 90, 90 to 180, and above 180, we would send the following query:
{ "query" : { "match_all" : {} }, "facets" : { "ranges_facet_result" : { "range" : { "field" : "total", "ranges" : [ { "to" : 90 }, { "from" : 90, "to" : 180 }, { "from" : 180 } ] } } } }
As you can see in the preceding query, we've defined the name of the field by using the field
property and the array of ranges using the ranges
property. Each range can be defined using the to
or from
properties or both at the same time.
The response for the preceding query could look like this:
{ . . . "facets" : { "ranges_facet_result" : { "_type" : "range", "ranges" : [ { "to" : 90.0, "count" : 18210, "min" : 0.0, "max" : 89.0, "total_count" : 18210, "total" : 39848.0, "mean" : 2.1882482152663374 }, { "from" : 90.0, "to" : 180.0, "count" : 159, "min" : 90.0, "max" : 178.0, "total_count" : 159, "total" : 19897.0, "mean" : 125.13836477987421 }, { "from" : 180.0, "count" : 274, "min" : 182.0, "max" : 57676.0, "total_count" : 274, "total" : 585961.0, "mean" : 2138.543795620438 } ] } } }
As you can see, because we've defined three ranges in our query for the range faceting, we've got those in response. For each range the following statistics were returned:
from
: The left boundary of the rangeto
: The right boundary of the rangemin
: The minimum field value of the field used for faceting in the given rangemax
: The maximum field value of the field used for faceting in the given rangecount
: The number of documents with a value of the defined field that falls into the specified rangetotal_count
: The total number of values in the defined field that fall into the specified range (should be the same as count
for single-valued fields and can be different for fields with multiple values)total
: The sum of all the values in the defined field that fall into the specified rangemean
: The mean value calculated for the values in the given field used for range faceting calculation that falls into the specified rangeIf we want to calculate the aggregated data statistics for a different field than we calculate the ranges for, we can use two properties—key_field
and key_value
(or key_script
and value_script
, which allow for script usage). The key_field
property specifies which field value should be used to check whether the value falls into a given range and the value_field
property specifies which field value should be used for aggregation calculation.
A histogram faceting allows us
to build a histogram of values across intervals of the field value (for numerical and date-based fields). For example, if we wanted to see how many documents fall into intervals of 1000 in our total
field, we would run the following query:
{ "query" : { "match_all" : {} }, "facets" : { "total_histogram" : { "histogram" : { "field" : "total", "interval" : 1000 } } } }
As you can see, we've used the histogram
facet type, and in addition to the field
property, we've included the interval
property, which defines the interval we want to use.
A sample response for the preceding query could look like this:
{ . . . "facets" : { "total_histogram" : { "_type" : "histogram", "entries" : [ { "key" : 0, "count" : 18565 }, { "key" : 1000, "count" : 33 }, { "key" : 2000, "count" : 14 }, { "key" : 3000, "count" : 5 }, (...) ] } } }
As you can see in these results for the first bracket of 0 to 1000, we have 18,565 documents; for the second bracket of 1000 to 2000 we have 33 documents, and so on.
In addition to the histogram
facets type, which can be used on numerical fields, ElasticSearch allows us to use the date_histogram
faceting type, which can be used on date-based fields. The date_histogram
type allows us to use constants such as year
, month
, week
, day
, hour
, or minute
as the value of the interval
property. For example, one could send the following query:
{ "query" : { "match_all" : {} }, "facets" : { "date_histogram_test" : { "date_histogram" : { "field" : "date", "interval" : "day" } } } }
The statistical
faceting allows us to compute statistical data for a numeric field type. In return, we get the count, total, sum of squares, average, minimum, maximum, variance, and standard deviation. For example, if we wanted to compute statistics for our total
field, we would run the following query:
{
"query" : { "match_all" : {} },
"facets" : {
"statistical_test" : {
"statistical" : {
"field" : "total"
}
}
}
}
As a result we would get the following response:
{ . . . "facets" : { "statistical_test" : { "_type" : "statistical", "count" : 18643, "total" : 645706.0, "min" : 0.0, "max" : 57676.0, "mean" : 34.63530547658639, "sum_of_squares" : 1.2490405256E10, "variance" : 668778.6853747752, "std_deviation" : 817.7889002516329 } } }
These are the statistics that were returned:
_type
: The faceting typecount
: The number of documents with the specified value in the defined fieldtotal
: The sum of all the values in the defined fieldmin
: The minimum field valuemax
The maximum field valuemean
: The mean value calculated for the values in the specified fieldsum_of_squares
: The sum of squares calculated for the values in the specified fieldvariance
: The variance value calculated for the values in the specified fieldstd_deviation
: The standard deviation value calculated for the values in the specified fieldThe terms_stats
faceting combines both statistical
and terms
faceting types as it provides the ability to compute statistics on a field for values got from another field. For example, if we wanted the faceting for the total
field but to divide the value on the basis of the tags
field, we would run the following query:
{ "query" : { "match_all" : {} }, "facets" : { "total_tags_terms_stats" : { "terms_stats" : { "key_field" : "tags", "value_field" : "total" } } } }
We've specified the key_field
property, which holds the name of the field that provides the terms, and the value_field
property, which holds the name of the field with numerical data values. Here is a portion of the results we got from ElasticSearch:
{ . . . "facets" : { "total_tags_terms_stats" : { "_type" : "terms_stats", "missing" : 54715, "terms" : [ { "term" : "personal", "count" : 1063, "total_count" : 254, "min" : 0.0, "max" : 322.0, "total" : 707.0, "mean" : 2.783464566929134 }, { "term" : "me", "count" : 715, "total_count" : 218, "min" : 0.0, "max" : 138.0, "total" : 710.0, "mean" : 3.256880733944954 } (...) ] } } }
As you can see, the faceting results were divided on a per-term basis. Please note that the same set of statistics was returned for each term as the ones that are returned for the ranges faceting (please refer to the Range subsection in the Faceting section in this chapter for an explanation of what those values mean). This is because we've used a numerical field (total
) to calculate the facet values for each field.
The last faceting calculation type we would like to discuss is the geo_distance
type. It allows us to get the information about the number of documents that fall into distance ranges from a given location. For example, let's assume that we have a location
field in our documents in the index that stores the geographical point. And now imagine that we would like to get information about how many documents fall into the bracket of 10 kilometers from the 10.0,10.0 spatial point, how many fall into the bracket of 10 to 100 kilometers, and how many into that of more than 100 kilometers. In order to do that we will run the following query:
{ "query" : { "match_all" : {} }, "facets" : { "spatial_test" : { "geo_distance" : { "location" : { "lat" : 10.0, "lon" : 10.0 }, "ranges" : [ { "to" : 10 }, { "from" : 10, "to" : 100 }, { "from" : 100 } ] } } } }
In the preceding query we've defined the latitude (the lat
property) and the longitude (the lon
property) of the point from which we want to calculate the distance. We choose, as the name of the field that holds the location, the name of the object to which we pass the lat
and lon
properties. The second thing is the ranges
array, which specifies the brackets; each range can be defined using the to
or from
properties or both at the same time.
In addition to the previously mentioned properties, we are also allowed to set the unit
property (default: km
for distance in kilometers and mi
for distance in miles) and the distance_type
property (default: arc
for better precision and plane
for faster execution).
As we mentioned before, the filters you include in your queries don't narrow the faceting results, so you'll just get the documents matching your query. However, you may include the filters you want in your faceting definition. Basically, any filter we've discussed in Chapter 2, Searching Your Data, can be used with faceting. What you just need to do is include an additional section under the facet name.
For example, if we want our query to match all documents but have facets calculated for the multivalued tags
field—only for those documents that have the term fashion
in the tags
field—we could run the following query:
{ "query" : { "match_all" : {} }, "facets" : { "tags" : { "terms" : { "field" : "tags" }, "facet_filter" : { "term" : { "tags" : "fashion" } } } } }
As you can see, there is an additional facet_filter
section on the same level as the type of facet calculation (which is terms
in the preceding query). You just need to remember that the facet_filter
section is constructed with the same logic as any filter described in Chapter 2.
Imagine a situation where you would like to calculate facets, not on the information from the parent documents but instead using the information present in nested documents. In order to do that, ElasticSearch provides the scope
and nested
properties, allowing us to define what documents are seen by our facets.
In order to illustrate how to use those properties, let's recall our clothing store mappings, which we used in Chapter 5, Combining Indexing, Analysis, and Search, when we talked about nested objects (in the Using nested objects section). So let's recall it:
{ "cloth" : { "properties" : { "name" : {"type" : "string", "store" : "yes", "index" : "analyzed"}, "variation" : { "type" : "nested", "properties" : { "size" : {"type" : "string", "store" : "yes", "index" : "not_analyzed"}, "color" : {"type" : "string", "store" : "yes", "index" : "not_analyzed"} } } } } }
The simplest way to calculate faceting on all the nested documents matching the parent documents that were returned by the query is to define the nested
property and set its value to the name of the nested documents we are interested in. For example, let's look at the following query:
{ "query": { "match_all": {} }, "facets": { "size": { "terms" : { "field" : "size" }, "nested": "variation" } } }
As you can see, we want to calculate the terms
facets on the size
field. However, because this is a nested object, we've specified the nested
property and have set its value to the name of the nested document we are interested in, which in our case was variation
. The shortened response to the previous query (which shows the facets calculation) is as follows:
{ . . . "facets" : { "size" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "XXL", "count" : 1 }, { "term" : "XL", "count" : 1 } ] } } }
The method we've just used is good when we are interested in calculating facets for all the nested documents of a certain parent type that matched the query. However, sometimes we may be only interested in documents that match a certain part of the query. This is where the scope
property comes into play. Let's look at the following query:
{ "query" : { "nested" : { "path" : "variation", "query" : { "term" : { "variation.size" : "XL" } } } } }
And now, let's introduce the scope. There are two places where we need to add it—first, in the query itself (where we need to introduce the _scope
property with the name of our choice) and then in the facet calculation part (where we need to specify the scope we are interested in with the use of the scope
property). Yes, you're right! We can have multiple _scope
properties in different parts of the query and calculate different facets for different scopes. But let's get back to our modified query, which would look something like this:
{ "query" : { "nested" : { "_scope": "es_book_scope", "path" : "variation", "query" : { "term" : { "variation.size" : "XL" } } } }, "facets": { "size": { "terms" : { "field" : "size" }, "scope": "es_book_scope" } } }
The facets results for the preceding query should only be calculated on the nested documents that match the XL
value in the size
field; in fact, ElasticSearch returned what we expected:
{ . . . "facets" : { "size" : { "_type" : "terms", "missing" : 0, "total" : 1, "other" : 0, "terms" : [ { "term" : "XL", "count" : 1 } ] } } }
Faceting can be memory-intensive, especially with large amounts of data in the indices and many distinct values. This is because ElasticSearch needs to load the data into the so-called field data cache
in order to calculate faceting values. In the case of large amounts of data, you may be forced to change your index structure, for example, by lowering the cardinality of your fields by using less precise dates (not analyzed string fields or types such as short
, integer
, or float
instead of long
and double
) when possible. If that doesn't help, you may need to give ElasticSearch more heap memory or even add more servers and divide your index to more shards.