We already know how to build queries and searches by using different criteria. We know how scoring works, which document is more important for a given query, and how input text can affect ordering. But sometimes, we want to choose only a subset of our index, and the chosen criterion should not have an influence on scoring. This is the place where filters should be used.
Frankly enough, we should use filters whenever possible. If a given part of the query does not affect scoring, it is a good candidate to turn into a filter. Score calculation complicates things, and filtering is a relatively simple operation like a simple match-don't match calculation. Due to the fact that filtering is done on all index contents, the result of filtering is independent of the found documents and relationship between them. Filters can easily be cached, further increasing the overall performance of filtered queries.
To use a filter in any search, just add filter
to the query
attribute. Let's take a sample query and add a filter to it:
{ "query" : { "field" : { "title" : "Catch-22" } }, "filter" : { "term" : { "year" : 1961 } } }
This would return all the documents with the given title, but that result would be narrowed down to only books published in 1961. This query can be rewritten as follows:
{ "query": { "filtered" : { "query" : { "field" : { "title" : "Catch-22" } }, "filter" : { "term" : { "year" : 1961 } } } } }
If you run both queries by sending the curl -XGET localhost:9200/library/book/_search?pretty -d @query.json
command, you will see that both responses are exactly the same (except perhaps the response time):
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.2712221, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "2", "_score" : 0.2712221, "_source" : { "title": "Catch-22","author": "Joseph Heller","year": 1961,"characters": ["John Yossarian", "Captain Aardvark", "Chaplain Tappman", "Colonel Cathcart", "Doctor Daneeka"],"tags": ["novel"],"copies": 6, "available" : false} } ] }
This suggests that both forms are equivalent. This is not true because of the different orders of applying the filters and searching. In the first case, filters are applied to all documents found by the query. In the second case, the documents are filtered before the query runs. This yields better performance. As we said earlier, filters are fast, so a filtered query is more efficient. We will return to this in the Faceting section in Chapter 6, Beyond Searching.
A range filter allows us to limit searching to only documents where the value of a field is between the given boundaries. For example, to construct a filter that allows only books published between 1930 and 1990, we would use the following query part:
{ "filter" : { "range" : { "year" : { "from": 1930, "to": 1990 } } } }
By default, the left and right boundaries of the field are inclusive. If you want to exclude one or both the bounds, you can use the include_lower
and/or include_upper
parameters set to false. For example, if we would like to have documents from 1930 (including the ones with that value) to 1990 (excluding that value), we would construct the following filter:
{ "filter" : { "range" : { "year" : { "from": 1930, "to": 1990, "include_lower" : true, "include_upper" : false } } } }
The other option is to use gt
(greater than), lt
(lower than), gte
(greater than or equal to), and lte
(lower than or equal to) in place of the to
and from
parameters. So, the preceding example may be rewritten as follows:
{ "filter" : { "range" : { "year" : { "gte": 1930, "lt": 1990 } } } }
There is also a second variant of this filter, numeric_filter
. It is a specialized version designed for filtering ranges where field values are numerical. This filter is faster but comes at the cost of the
additional memory used by field values. Note that sometimes these values should be loaded independently of the range filter. In those cases, there is no reason not to use this filter. This happens, for example, where this data is used in faceting or sorting (we'll discuss it in greater detail in the coming chapters).
This filter is very simple. It takes only those documents that have the given field defined, for example:
{ "filter" : { "exists" : { "field": "year" } } }
The missing filter is
the opposite of the exists filter. However, it has a few additional features. Besides selecting the documents where the specified fields are missing, we have the possibility of defining what ElasticSearch should treat as empty. This helps in situations where input data contains tokens such as null
, EMPTY
, and not-defined
. Let's change the preceding example to find all documents without the year
field defined (or the ones that have 0
as the value of the year
field. The modified filter would look like the following code:
{ "filter" : { "missing" : { "field": "year", "null_value": 0, "existence": true } } }
In the preceding example, you see two parameters in addition to the previous ones: existence
,which tells ElasticSearch that it should check the documents with a value existing in the specified field, and the null_value
key, which defines the additional value to be treated as empty. If you don't define null_value
, existence
is set by default, so you can omit existence
in this case.
Sometimes, we want to filter our documents by a computed value. A good example for our case can be filtering out all the books that were published more than a century ago. In order to do that, our filter must look like the following code:
{ "filter" : { "script" : { "script" : "now - doc['year'].value > 100", "params" : { "now" : 2013 } } } }
A type filter
is a simple filter that returns all the documents of a given type. This filter is useful when a query is directed to several indices or an index with a large number of types. The following is an example of such filters that would limit the type of the documents to the book
type:
{ "filter" : { "type": { "value" : "book" } } }
This filter limits
the number of documents returned by a shard for a given query. This should not be confused with the size
parameter. Let's look at the following filter:
{ "filter" : { "limit" : { "value" : 1 } } }
When using the default settings for a number of shards, the preceding filter will return up to five documents. Why? This is connected with a number of shards (the index.number_of_shards
setting). Each shard is queried separately, and each shard may return at most one document.
The IDs filter helps in
cases when we have to filter out several, concrete documents. For example, if we need to exclude a document that has 1
as the identifier, the filter would look like this:
{ "filter": { "ids" : { "type": ["book"], "values": [1] } } }
Note that the type
parameter is not required. It is only useful when we search among several indices to specify a type we are interested in.
We have shown several examples for filters. But this is only the tip of the iceberg. You can wrap any query into a filter. For example, check out the following query:
{ "query" : { "multi_match" : { "query" : "novel erich", "fields" : [ "tags", "author" ] } } }
This query can be rewritten as a filter, thus:
{ "filter" : { "query" : { "multi_match" : { "query" : "novel erich", "fields" : [ "tags", "author" ] } } } }
Of course, the only difference in the result will be in the scoring. Every document returned by a filter will have a score of 1.0
. Note that ElasticSearch has a few dedicated filters that act this way (for example, the term query and the term filter). So, you don't always have to use a wrapped query syntax. In fact, you should always use a dedicated version wherever possible.
Now it's time to
combine some filters together. The first option is to use the bool filter, which can group filters on the same basis as described previously for the bool query. The second option is to use the and, or, and not filters. The first two take an array of filters and return every document that matches all of them, in the case of the and
filter (or at least one filter in the case of the or
filter). In the case of the not
filter, returned documents are the ones that were not matched by the enclosed filter. Of course, all these filters may be nested as shown in the following example:
{ "filter": { "not": { "and": [ { "term": { "title": "Catch-22" } }, { "or": [ { "range": { "year": { "from": 1930, "to": 1990 } } }, { "term": { "available": true } } ] } ] } } }
Looking at how
complicated setting filters may be, sometimes it would be useful to know which filters were used to determine that a document should be returned by a query. Fortunately, it is possible to give every filter a name. This name will be returned with any document matched during the query. Let's check how that works. The following query will return every book that is available and tagged as novel
or every book from the nineteenth century:
{ "query": { "filtered" : { "query": { "matchAll" : {} }, "filter" : { "or" : [ { "and" : [ { "term": { "available" : true } }, { "term": { "tags" : "novel" } } ]}, { "range" : { "year" : { "from": 1800, "to" : 1899 } } } ] } } } }
We are using the "filtered" version of the query because this is the only version where ElasticSearch can add information about filters that were used. Let's rewrite this query and name each filter:
{ "query": { "filtered" : { "query": { "matchAll" : {} }, "filter" : { "or" : { "filters" : [ { "and" : { "filters" : [ { "term": { "available" : true, "_name" : "avail" } }, { "term": { "tags" : "novel", "_name" : "tag" } } ], "_name" : "and" } }, { "range" : { "year" : { "from": 1800, "to" : 1899 }, "_name" : "year" } } ], "_name" : "or" } } } } } }
It's much longer, isn't it?
We've added the _name
element to every filter. In the case of the and
and or
filters, we needed to change the syntax; we wrapped the enclosed filters by an additional object to make the JSON format correct. After sending a query to ElasticSearch, we should get a response similar to the following one:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "1", "_score" : 1.0, "_source" : { "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"characters": ["Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden"],"tags": ["novel"],"copies": 1, "available": true}, "matched_filters" : [ "or", "tag", "avail", "and" ] }, { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 1.0, "_source" : { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true}, "matched_filters" : [ "or", "year", "avail" ] } ] }
You can see that every document in addition to standard information also contains a an array with filter names that were matched for that document.
The last thing about filters that we want to mention in this chapter is caching. Caching increases speed for the queries that use filters, but at the cost of memory and query time during the first execution of such a filter. Because of this, the best candidates for caching are filters that can be reused, for example, the ones that we will use frequently (which also includes the parameter values).
Caching can be turned on for the and
, bool
, and or
filters (but it is usually a better idea to cache the enclosed filters instead). In this case, the required syntax is the same as described for the named filters.
Some filters don't support the _cache
parameter because their results are always cached. This is true for the exists
, missing
, range
, term
, and terms
filters that are cached by default, but this behavior can be modified and caching can be turned off. Caching also doesn't make sense for the ids
, matchAll
, and limit
filters.