Chapter 6. Beyond Searching

In the previous chapter we've looked at how to index tree-like structures, how to use nested objects, and the parent-child relationship. We've also discussed fetching data from external systems using the river plugins and batch processing to speed up indexing. We've seen how to use the index update API. In this chapter, we will learn how to use faceting, which will allow us to get aggregated data about our search results. We will also see how to get similar documents with the "more like this" functionality and how to use prospective search to store queries, not documents. By the end of this chapter you will have learned:

  • How to use faceting
  • How to use the "more like this" REST endpoint
  • What a percolator is and how to use it

Faceting

ElasticSearch is a full text search engine that aims to provide search results on the basis of our queries. However, sometimes we would like to get more. For example, we would like to get aggregated data that is calculated on the result set we get, such as the number of documents priced between 100 and 200 dollars or the most common tags in the results documents. In order to do that, ElasticSearch provides a faceting module that is responsible for providing such data. In this chapter we will discuss different faceting methods provided by ElasticSearch.

Document structure

For the purpose of discussing faceting, we'll use a very simple index structure for our documents. It will contain the identifier of the document, document date, a multivalued field that can hold words describing our document (the tags field), and a field holding numeric information (the total field). Our mappings could look like this:

{
 "mappings" : {
  "doc" : {
   "properties" : {                
    "id" : { "type" : "long", "store" : "yes" },
    "date" : { "type" : "date", "store" : "no" },
    "tags" : { "type" : "string", "store" : "no", "index" : "not_analyzed" },
    "total" : { "type" : "long", "store" : "no" }
   }
  }
 }
}

Note

Keep in mind that when dealing with string fields you should avoid doing faceting on analyzed fields. Such results may not be human readable, especially when using stemming or any other heavy processing analyzers or filters.

Returned results

Before we get into how to run faceting, let's take a look on what to expect from ElasticSearch as a result of faceting requests. In most cases you'll only be interested in the data specific to the faceting type. However in most faceting types, in addition to information specific to a given faceting type, you'll get the following as well:

  • _type: This specifies the faceting type used and will be provided for each faceting type
  • missing: This specifies the number of documents that didn't have enough data (for example, a missing field) to calculate faceting
  • total: This specifies the number of tokens in the facet calculation
  • other: This specifies the number of facet values (for example, terms in terms faceting) not included in the returned counts

In addition to these types, you'll get an array of calculated facets such as count for your terms, queries, or spatial distance. For example, this is what the usual faceting results look like:

{
  .
  .
  .
  "facets" : {
    "tags" : {
      "_type" : "terms",
      "missing" : 54715,
      "total" : 151266,
      "other" : 143140,
      "terms" : [ {
        "term" : "test",
        "count" : 1119
      }, {
        "term" : "personal",
        "count" : 1063
      },
      (...) 
      ]
    }
  }
}

As you can see in the results, our faceting was run against the tags field. We've got a total number of 151,266 tokens processed by the faceting module, and 143,140 that were not included in the results. We also have 54,715 documents that didn't have the value in the tags field. The term "test" appeared in 1,119 documents and the term "personal" appeared in 1,063 documents. This is what you can expect from a faceting response.

Query

Query is one of the simplest faceting types that allows us to get the number of documents that match the query in the faceting results. The query itself can be expressed using the ElasticSearch query language, which we discussed in Chapter 2, Searching Your Data. For example, faceting that would return the number of documents for a simple term query could look like this:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "my_query_facet" : {
   "query" : {
    "term" : { "tags" : "personal" }
   }
  }
 }
}

As you can see, we've included the query type faceting with a simple term query.

A sample response for the preceding query could look like this:

{
  .
  .
  .
  "facets" : {
    "my_query_facet" : {
      "_type" : "query",
      "count" : 1081
    }
  }
}

As you can see, in the response, we've got the faceting type and the count of the documents that matched the facet query, and of course, the main query.

Filter

A filter is a simple faceting type that allows us to get the number of documents that match the filter. The filter itself can be expressed using the ElasticSearch query language. For example, faceting that would return the number of documents for a simple term filter could look like this:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "my_filter_facet" : {
   "filter" : {
    "term" : { "tags" : "personal" }
   }
  }
 }
}

As you can see, we've included the filter type faceting with a simple term filter. When talking about performance, filter facets are faster than query facets or filter facets that wrap queries.

An example response for the preceding query could look like this:

{
  .
  .
  .
  "facets" : {
    "my_filter_facet" : {
      "_type" : "filter",
      "count" : 1081
    }
  }
}

As you can see in the response, we've got the faceting type and the count of the documents that matched the facet filter, and of course, the main query.

Terms

Terms faceting allows specifying a field, and ElasticSearch will return the most frequent terms for that field. For example, if we want to calculate terms faceting for the tags field, we could run the following query:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "tags_facet_result" : {
   "terms" : {
    "field" : "tags"
   }
  }
 }
}

The following faceting response will be returned by ElasticSearch for the preceding query:

{
  .
  .
  .
  "facets" : {
    "tags_facet_result" : {
      "_type" : "terms",
      "missing" : 54716,
      "total" : 151266,
      "other" : 143140,
      "terms" : [ {
        "term" : "test",
        "count" : 1119
      }, {
        "term" : "personal",
        "count" : 1063
      }, {
        "term" : "feel",
        "count" : 982
      }, {
        "term" : "hot",
        "count" : 923
      },
      (...)
      ]
    }
  }
}

As you can see, our terms faceting results were returned in the tags_facet_result section and we've got the information that was already described.

There are a few additional parameters we can use for terms faceting:

  • size: This specifies how many most frequent terms should be returned at most. The documents with subsequent terms will be included in the count of the other field in the result.
  • order: This specifies the faceting ordering. The possible values are:
    • count (the default order by frequency, starting from the most frequent)
    • term (alphabetical order, ascending)
    • reverse_count (order by frequency, starting from the less frequent)
    • reverse_term (alphabetical order, descending)
  • all_terms: This parameter, when set to true, will return all the terms in the result.
  • exclude: This is an array of terms that should be excluded from facet calculation.
  • regex: This is a regular expression that will control the terms to be included in the calculation.
  • script: This specifies the script that will be used to process the term used in facet calculation.
  • fields: This is an array that allows specifying multiple fields for faceting calculation (which should be used instead of the field property). ElasticSearch will return aggregation across multiple fields.
  • _script_field: This specifies the script that will provide the actual term for calculation. For example, any term based from the _source field may be used.

Range

Range faceting allows us to get the number of documents for a defined set of ranges, and in addition to that, get data aggregated for the specified field. For example, if we wanted to get the number of documents that have the value in the total field falling into the ranges (lower bound inclusive, upper exclusive) up to 90, 90 to 180, and above 180, we would send the following query:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "ranges_facet_result" : {
   "range" : {
    "field" : "total",
    "ranges" : [
     { "to" : 90 },
     { "from" : 90, "to" : 180 },
     { "from" : 180 }
    ]
   }
  }
 }
}

As you can see in the preceding query, we've defined the name of the field by using the field property and the array of ranges using the ranges property. Each range can be defined using the to or from properties or both at the same time.

The response for the preceding query could look like this:

{
  .
  .
  .
  "facets" : {
    "ranges_facet_result" : {
      "_type" : "range",
      "ranges" : [ {
        "to" : 90.0,
        "count" : 18210,
        "min" : 0.0,
        "max" : 89.0,
        "total_count" : 18210,
        "total" : 39848.0,
        "mean" : 2.1882482152663374
      }, {
        "from" : 90.0,
        "to" : 180.0,
        "count" : 159,
        "min" : 90.0,
        "max" : 178.0,
        "total_count" : 159,
        "total" : 19897.0,
        "mean" : 125.13836477987421
      }, {
        "from" : 180.0,
        "count" : 274,
        "min" : 182.0,
        "max" : 57676.0,
        "total_count" : 274,
        "total" : 585961.0,
        "mean" : 2138.543795620438
      } ]
    }
  }
}

As you can see, because we've defined three ranges in our query for the range faceting, we've got those in response. For each range the following statistics were returned:

  • from: The left boundary of the range
  • to: The right boundary of the range
  • min: The minimum field value of the field used for faceting in the given range
  • max: The maximum field value of the field used for faceting in the given range
  • count: The number of documents with a value of the defined field that falls into the specified range
  • total_count: The total number of values in the defined field that fall into the specified range (should be the same as count for single-valued fields and can be different for fields with multiple values)
  • total: The sum of all the values in the defined field that fall into the specified range
  • mean: The mean value calculated for the values in the given field used for range faceting calculation that falls into the specified range

Choosing different fields for aggregated data calculation

If we want to calculate the aggregated data statistics for a different field than we calculate the ranges for, we can use two properties—key_field and key_value (or key_script and value_script, which allow for script usage). The key_field property specifies which field value should be used to check whether the value falls into a given range and the value_field property specifies which field value should be used for aggregation calculation.

Numerical and date histogram

A histogram faceting allows us to build a histogram of values across intervals of the field value (for numerical and date-based fields). For example, if we wanted to see how many documents fall into intervals of 1000 in our total field, we would run the following query:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "total_histogram" : {
   "histogram" : {
    "field" : "total",
    "interval" : 1000
   }
  }
 }
}

As you can see, we've used the histogram facet type, and in addition to the field property, we've included the interval property, which defines the interval we want to use.

A sample response for the preceding query could look like this:

{
  .
  .
  .
  "facets" : {
    "total_histogram" : {
      "_type" : "histogram",
      "entries" : [ {
        "key" : 0,
        "count" : 18565
      }, {
        "key" : 1000,
        "count" : 33
      }, {
        "key" : 2000,
        "count" : 14
      }, {
        "key" : 3000,
        "count" : 5
      }, 
      (...)
      ]
    }
  }
}

As you can see in these results for the first bracket of 0 to 1000, we have 18,565 documents; for the second bracket of 1000 to 2000 we have 33 documents, and so on.

Date histogram

In addition to the histogram facets type, which can be used on numerical fields, ElasticSearch allows us to use the date_histogram faceting type, which can be used on date-based fields. The date_histogram type allows us to use constants such as year, month, week, day, hour, or minute as the value of the interval property. For example, one could send the following query:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "date_histogram_test" : {
   "date_histogram" : {
    "field" : "date",
    "interval" : "day"
   }
  }
 }
}

Note

In both numerical and date histogram faceting, we can use the key_field, key_value, key_script, and value_script properties, which we've discussed when talking about terms faceting earlier in this chapter.

Statistical

The statistical faceting allows us to compute statistical data for a numeric field type. In return, we get the count, total, sum of squares, average, minimum, maximum, variance, and standard deviation. For example, if we wanted to compute statistics for our total field, we would run the following query:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "statistical_test" : {
   "statistical" : {
    "field" : "total"
   }
  }
 }
}

As a result we would get the following response:

{
  .
  .
  .
  "facets" : {
    "statistical_test" : {
      "_type" : "statistical",
      "count" : 18643,
      "total" : 645706.0,
      "min" : 0.0,
      "max" : 57676.0,
      "mean" : 34.63530547658639,
      "sum_of_squares" : 1.2490405256E10,
      "variance" : 668778.6853747752,
      "std_deviation" : 817.7889002516329
    }
  }
}

These are the statistics that were returned:

  • _type: The faceting type
  • count: The number of documents with the specified value in the defined field
  • total: The sum of all the values in the defined field
  • min: The minimum field value
  • max The maximum field value
  • mean: The mean value calculated for the values in the specified field
  • sum_of_squares: The sum of squares calculated for the values in the specified field
  • variance: The variance value calculated for the values in the specified field
  • std_deviation: The standard deviation value calculated for the values in the specified field

Note

Please note that we are also allowed to use the script and fields properties in statistical faceting just like in terms faceting.

Terms statistics

The terms_stats faceting combines both statistical and terms faceting types as it provides the ability to compute statistics on a field for values got from another field. For example, if we wanted the faceting for the total field but to divide the value on the basis of the tags field, we would run the following query:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "total_tags_terms_stats" : {
   "terms_stats" : {
    "key_field" : "tags",
    "value_field" : "total"
   }
  }
 }
}

We've specified the key_field property, which holds the name of the field that provides the terms, and the value_field property, which holds the name of the field with numerical data values. Here is a portion of the results we got from ElasticSearch:

{
  .
  .
  .
  "facets" : {
    "total_tags_terms_stats" : {
      "_type" : "terms_stats",
      "missing" : 54715,
      "terms" : [ {
        "term" : "personal",
        "count" : 1063,
        "total_count" : 254,
        "min" : 0.0,
        "max" : 322.0,
        "total" : 707.0,
        "mean" : 2.783464566929134
      }, {
        "term" : "me",
        "count" : 715,
        "total_count" : 218,
        "min" : 0.0,
        "max" : 138.0,
        "total" : 710.0,
        "mean" : 3.256880733944954
      }
      (...)
      ]
    }
  }
}

As you can see, the faceting results were divided on a per-term basis. Please note that the same set of statistics was returned for each term as the ones that are returned for the ranges faceting (please refer to the Range subsection in the Faceting section in this chapter for an explanation of what those values mean). This is because we've used a numerical field (total) to calculate the facet values for each field.

Spatial

The last faceting calculation type we would like to discuss is the geo_distance type. It allows us to get the information about the number of documents that fall into distance ranges from a given location. For example, let's assume that we have a location field in our documents in the index that stores the geographical point. And now imagine that we would like to get information about how many documents fall into the bracket of 10 kilometers from the 10.0,10.0 spatial point, how many fall into the bracket of 10 to 100 kilometers, and how many into that of more than 100 kilometers. In order to do that we will run the following query:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "spatial_test" : {
   "geo_distance" : {
    "location" : {
     "lat" : 10.0,
     "lon" : 10.0
    },
    "ranges" : [
     { "to" : 10 },
     { "from" : 10, "to" : 100 },
     { "from" : 100 }
    ]
   }
  }
 }
}

In the preceding query we've defined the latitude (the lat property) and the longitude (the lon property) of the point from which we want to calculate the distance. We choose, as the name of the field that holds the location, the name of the object to which we pass the lat and lon properties. The second thing is the ranges array, which specifies the brackets; each range can be defined using the to or from properties or both at the same time.

In addition to the previously mentioned properties, we are also allowed to set the unit property (default: km for distance in kilometers and mi for distance in miles) and the distance_type property (default: arc for better precision and plane for faster execution).

Filtering faceting results

As we mentioned before, the filters you include in your queries don't narrow the faceting results, so you'll just get the documents matching your query. However, you may include the filters you want in your faceting definition. Basically, any filter we've discussed in Chapter 2, Searching Your Data, can be used with faceting. What you just need to do is include an additional section under the facet name.

For example, if we want our query to match all documents but have facets calculated for the multivalued tags field—only for those documents that have the term fashion in the tags field—we could run the following query:

{
 "query" : { "match_all" : {} },
 "facets" : {
  "tags" : {
   "terms" : { "field" : "tags" },
   "facet_filter" : {
    "term" : { "tags" : "fashion" }
   }
  }
 }
}

As you can see, there is an additional facet_filter section on the same level as the type of facet calculation (which is terms in the preceding query). You just need to remember that the facet_filter section is constructed with the same logic as any filter described in Chapter 2.

Scope of your faceting calculation

Imagine a situation where you would like to calculate facets, not on the information from the parent documents but instead using the information present in nested documents. In order to do that, ElasticSearch provides the scope and nested properties, allowing us to define what documents are seen by our facets.

In order to illustrate how to use those properties, let's recall our clothing store mappings, which we used in Chapter 5, Combining Indexing, Analysis, and Search, when we talked about nested objects (in the Using nested objects section). So let's recall it:

{
 "cloth" : {
  "properties" : {
   "name" : {"type" : "string", "store" : "yes", "index" : "analyzed"},
   "variation" : {
    "type" : "nested",
    "properties" : {
     "size" : {"type" : "string", "store" : "yes", "index" : "not_analyzed"},
     "color" : {"type" : "string", "store" : "yes", "index" : "not_analyzed"}
    }
   }
  }
 }
}

Facet calculation on all nested documents

The simplest way to calculate faceting on all the nested documents matching the parent documents that were returned by the query is to define the nested property and set its value to the name of the nested documents we are interested in. For example, let's look at the following query:

{
 "query": { "match_all": {} },
 "facets": {
  "size": {
   "terms" : { "field" : "size" },
   "nested": "variation"
  }
 }
}

As you can see, we want to calculate the terms facets on the size field. However, because this is a nested object, we've specified the nested property and have set its value to the name of the nested document we are interested in, which in our case was variation. The shortened response to the previous query (which shows the facets calculation) is as follows:

{
  .
  .
  .
  "facets" : {
    "size" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 2,
      "other" : 0,
      "terms" : [ {
        "term" : "XXL",
        "count" : 1
      }, {
        "term" : "XL",
        "count" : 1
      } ]
    }
  }
}

Facet calculation on nested documents that match a query

The method we've just used is good when we are interested in calculating facets for all the nested documents of a certain parent type that matched the query. However, sometimes we may be only interested in documents that match a certain part of the query. This is where the scope property comes into play. Let's look at the following query:

{
 "query" : {
  "nested" : {
   "path" : "variation",
   "query" : {
    "term" : { "variation.size" : "XL" }
   }
  }
 }
}

And now, let's introduce the scope. There are two places where we need to add it—first, in the query itself (where we need to introduce the _scope property with the name of our choice) and then in the facet calculation part (where we need to specify the scope we are interested in with the use of the scope property). Yes, you're right! We can have multiple _scope properties in different parts of the query and calculate different facets for different scopes. But let's get back to our modified query, which would look something like this:

{
 "query" : {
  "nested" : {
   "_scope": "es_book_scope",
   "path" : "variation",
   "query" : { "term" : { "variation.size" : "XL" } }
  }
 }, 
 "facets": {
  "size": {
   "terms" : {
    "field" : "size"
   },
   "scope": "es_book_scope"
  }
 }
}

The facets results for the preceding query should only be calculated on the nested documents that match the XL value in the size field; in fact, ElasticSearch returned what we expected:

{
  .
  .
  .  
  "facets" : {
    "size" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 1,
      "other" : 0,
      "terms" : [ {
        "term" : "XL",
        "count" : 1
      } ]
    }
  }
}

Faceting memory considerations

Faceting can be memory-intensive, especially with large amounts of data in the indices and many distinct values. This is because ElasticSearch needs to load the data into the so-called field data cache in order to calculate faceting values. In the case of large amounts of data, you may be forced to change your index structure, for example, by lowering the cardinality of your fields by using less precise dates (not analyzed string fields or types such as short, integer, or float instead of long and double) when possible. If that doesn't help, you may need to give ElasticSearch more heap memory or even add more servers and divide your index to more shards.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset