Handling filters and why it matters

Let's have a look at the filtering functionality provided by Elasticsearch. At first it may seem like a redundant functionality because almost all the filters have their query counterpart present in Elasticsearch Query DSL. But there must be something special about those filters because they are commonly used and they are advised when it comes to query performance. This section will discuss why filtering is important, how filters work, and what type of filtering is exposed by Elasticsearch.

Filters and query relevance

The first difference when comparing queries to filters is the influence on the document score. Let's compare queries and filters to see what to expect. We will start with the following query:

curl -XGET "http://127.0.0.1:9200/library/_search?pretty" -d'
{
    "query": {
        "term": {
           "title": {
              "value": "front"
           }
        }
    }
}'

The results for that query are as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 0.11506981,
      "_source":{ "title": "All Quiet on the Western  Front","otitle": "Im Westen nichts Neues","author": "Erich  Maria Remarque","year": 1929,"characters": ["Paul Bäumer",  "Albert Kropp", "Haie Westhus", "Fredrich Müller",  "Stanislaus Katczinsky", "Tjaden"],"tags":  ["novel"],"copies": 1,
      "available": true, "section" : 3}
    } ]
  }
}

There is nothing special about the preceding query. Elasticsearch will return all the documents having the front value in the title field. What's more, each document matching the query will have its score calculated and the top scoring documents will be returned as the search results. In our case, the query returned one document with the score equal to 0.11506981. This is normal behavior when it comes to querying.

Now let's compare a query and a filter. In case of both query and filter cases, we will add a fragment narrowing the documents to the ones having a single copy (the copies field equal to 1). The query that doesn't use filtering looks as follows:

curl -XGET "http://127.0.0.1:9200/library/_search?pretty" -d'
{
   "query": {
      "bool": {
         "must": [
            {
               "term": {
                  "title": {
                     "value": "front"
                  }
               }
            },
            {
               "term": {
                  "copies": {
                     "value": "1"
                  }
               }
            }
         ]
      }
   }
}'

The results returned by Elasticsearch are very similar and look as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.98976034,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 0.98976034,
      "_source":{ "title": "All Quiet on the Western  Front","otitle": "Im Westen nichts Neues","author": "Erich  Maria Remarque","year": 1929,"characters": ["Paul Bäumer",  "Albert Kropp", "Haie Westhus", "Fredrich Müller",  "Stanislaus Katczinsky", "Tjaden"],"tags":  ["novel"],"copies": 1,
      "available": true, "section" : 3}
    } ]
  }
}

The bool query in the preceding code is built of two term queries, which have to be matched in the document for it to be a match. In the response we again have the same document returned, but the score of the document is 0.98976034 now. This is exactly what we suspected after reading the Default Apache Lucene scoring explained section of this chapter—both terms influenced the score calculation.

Now let's look at the second case—the query for the value front in the title field and a filter for the copies field:

curl -XGET "http://127.0.0.1:9200/library/_search?pretty" -d'
{
    "query": {
        "term": {
           "title": {
              "value": "front"
           }
        }
    },
    "post_filter": {
        "term": {
           "copies": {
              "value": "1"
           }
        }
    }
}'

Now we have the simple term query, but in addition we are using the term filter. The results are the same when it comes to the documents returned, but the score is different now, as we can look in the following code:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 0.11506981,
      "_source":{ "title": "All Quiet on the Western  Front","otitle": "Im Westen nichts Neues","author": "Erich  Maria Remarque","year": 1929,"characters": ["Paul Bäumer",  "Albert Kropp", "Haie Westhus", "Fredrich Müller",  "Stanislaus Katczinsky", "Tjaden"],"tags":  ["novel"],"copies": 1,
      "available": true, "section" : 3}
    } ]
  }
}

Our single document has got a score of 0.11506981 now—exactly as the base query we started with. This leads to the main conclusion—filtering does not affect the score.

Note

Please note that previous Elasticsearch versions were using filter for the filters section instead of the post_filter used in the preceding query. In the 1.x versions of Elasticsearch, both versions can be used, but please remember that filter can be removed in the future.

In general, there is a single main difference between how queries and filters work. The only purpose of filters is to narrow down results with certain criteria. The queries not only narrow down the results, but also care about their score, which is very important when it comes to relevancy, but also has a cost—the CPU cycles required to calculate the document score. Of course, you should remember that this is not the only difference between them, and the rest of this section will focus on how filters work and what is the difference between different filtering methods available in Elasticsearch.

How filters work

We already mentioned that filters do not affect the score of the documents they match. This is very important because of two reasons. The first reason is performance. Applying a filter to a set of documents hold in the index is simple and can be very efficient. The only significant information filter holds about the document is whether the document matches the filter or not—a simple flag.

Filters provide this information by returning a structure called DocIdSet (org.apache.lucene.search.DocIdSet). The purpose of this structure is to provide the view of the index segment with the filter applied on the data. It is possible by providing implementation of the Bits interface (org.apache.lucene.util.Bits), which is responsible for random access to information about documents in the filter (basically allows to check whether the document inside a segment matches the filter or not). The Bits structure is very effective because CPU can perform filtering using bitwise operations (and there is a dedicated CPU piece to handle such operations, you can read more about circular shifts at http://en.wikipedia.org/wiki/Circular_shift). We can also use the DocIdSetIterator on an ordered set of internal document identifiers, also provided by the DocIdSet.

The following figure shows how the classes using the Bits work:

How filters work

Lucene (and Elasticsearch) have various implementation of DocIdSet suitable for various cases. Each of the implementations differs when it comes to performance. However, choosing the correct implementation is the task of Lucene and Elasticsearch and we don't have to care about it, unless we extend the functionality of them.

Note

Please remember that not all filters use the Bits structure. The filters that don't do that are numeric range filters, script ones, and the whole group of geographical filters. Instead, those filters put data into the field data cache and iterate over documents filtering as they operate on a document. This means that the next filter in the chain will only get documents allowed by the previous filters. Because of this, those filters allow optimizations, such as putting the heaviest filters on the end of the filters, execution chain.

Bool or and/or/not filters

We talked about filters in Elasticsearch Server Second Edition, but we wanted to remind you about one thing. You should remember that and, or, and not filters don't use Bits, while the bool filter does. Because of that you should use the bool filter when possible. The and, or, and not filters should be used for scripts, geographical filtering, and numeric range filters. Also, remember that if you nest any filter that is not using Bits inside the and, or, or not filter, Bits won't be used.

Basically, you should use the and, or, and not filters when you combine filters that are not using Bits with other filters. And if all your filters use Bits, then use the bool filter to combine them.

Performance considerations

In general, filters are fast. There are multiple reasons for this—first of all, the parts of the query handled by filters don't need to have a score calculated. As we have already said, scoring is strongly connected to a given query and the set of indexed documents.

Note

There is one thing when it comes to filtering. With the release of Elasticsearch 1.4.0, the bitsets used for nested queries execution are loaded eagerly by default. This is done to allow faster nested queries execution, but can lead to memory problems. To disable this behavior we can set the index.load_fixed_bitset_filters_eagerly to false. The size of memory used for fixed bitsets can be checked by using the curl -XGET 'localhost:9200/_cluster/stats?human&pretty' command and looking at the fixed_bit_set_memory_in_bytes property in the response.

When using a filter, the result of the filter does not depend on the query, so the result of the filter can be easily cached and used in the subsequent queries. What's more, the filter cache is stored as per Lucene segment, which means that the cache doesn't have to be rebuilt with every commit, but only on segment creation and segment merge.

Note

Of course, as with everything, there are also downsides of using filters. Not all filters can be cached. Think about filters that depend on the current time, caching them wouldn't make much sense. Sometimes caching is not worth it because of too many unique values that can be used and poor cache hit ratio, an example of this can be filters based on geographical location.

Post filtering and filtered query

If someone would say that the filter will be quicker comparing to the same query, it wouldn't be true. Filters have fewer things to care about and can be reused between queries, but Lucene is already highly optimized and the queries are very fast, even considering that scoring has to be performed. Of course, for a large number of results, filter will be faster, but there is always something we didn't tell you yet. Sometimes, when using post_filter, the query sent to Elasticsearch won't be as fast and efficient as we would want it to be. Let's assume that we have the following query:

curl -XGET 'http://127.0.0.1:9200/library/_search?pretty' -d '{
 "query": {
  "terms": {
   "title": [ "crime", "punishment", "complete", "front" ]
  }
 },
 "post_filter" : {
  "term": {
   "available": {
    "value": true,
    "_cache": true
   }
  }
 }
}'

The following figure shows what is going on during query execution:

Post filtering and filtered query

Of course, filtering matters for higher amounts of data, but for the purpose of this example, we've used our data. In the preceding figure, our index contains four documents. Our example terms query matches three documents: Doc1, Doc3, and Doc4. Each of them is scored and ordered on the basis of the calculated score. After that, our post_filter starts its work. From all of our documents in the whole index, it passes only two of them—Doc1 and Doc4. As you can see from the three documents passed to the filter, only two of them were returned as the search result. So why are we bothering about calculating the score for the Doc3? In this case, we lost some CPU cycles for scoring a document that are not valid in terms of query. For a large number of documents returned, this can become a performance problem.

Note

Please note that in the example we've used the term filter, which was cached by default until Elasticsearch 1.5. That behavior changed starting with Elasticsearch 1.5 (see https://github.com/elasticsearch/elasticsearch/pull/7583). Because of that, we decided to use the term filter in the example, but with forced caching.

Let's modify our query and let's filter the documents before the Scorer calculates the score for each document. The query that does that looks as follows:

curl -XGET 'http://127.0.0.1:9200/library/_search?pretty' -d '{
 "query": {
  "filtered": {
   "query": {
    "terms": {
     "title": [ "crime", "punishment", "complete", "front" ]
    }
   },
   "filter": {
    "term": {
     "available": {
      "value": true,
      "_cache": true
     }
    }
   }
  }
 }
}'

In the preceding example, we have used the filtered query. The results returned by the preceding query will be exactly the same, but the execution of the query will be a little bit different, especially when it comes to filtering. Let's look at the following figure showing the logical execution of the query:

Post filtering and filtered query

Now the initial work is done by the term filter. If it was already used, it will be loaded from the cache, and the whole document set will be narrowed down to only two documents. Finally, those documents are scored, but now the scoring mechanism has less work to do. Of course, in the example, our query matches the documents returned by the filter, but this is not always true.

Technically, our filter is wrapped by query, and internally Lucene library collects results only from documents that meet the enclosed filter criteria. And, of course, only the documents matching the filter are forwarded to the main query. Thanks to filter, the scoring process has fewer documents to look at.

Choosing the right filtering method

If you read the preceding explanations, you may think that you should always use the filtered query and run away from post filtering. Such statement will be true for most use cases, but there are exceptions to this rule. The rule of thumb says that the most expensive operations should be moved to the end of query processing. If the filter is fast, cheap, and easily cacheable, then the situation is simple—use filtered query. On the other hand, if the filter is slow, CPU-intensive, and hard to cache (i.e., because of too many distinct values), use post filtering or try to optimize the filter by simplifying it and making it more cache friendly, for example by reducing the resolution in case of time-based filters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset