Filtering your results

We already know how to build queries and searches by using different criteria. We know how scoring works, which document is more important for a given query, and how input text can affect ordering. But sometimes, we want to choose only a subset of our index, and the chosen criterion should not have an influence on scoring. This is the place where filters should be used.

Frankly enough, we should use filters whenever possible. If a given part of the query does not affect scoring, it is a good candidate to turn into a filter. Score calculation complicates things, and filtering is a relatively simple operation like a simple match-don't match calculation. Due to the fact that filtering is done on all index contents, the result of filtering is independent of the found documents and relationship between them. Filters can easily be cached, further increasing the overall performance of filtered queries.

Using filters

To use a filter in any search, just add filter to the query attribute. Let's take a sample query and add a filter to it:

{
 "query" : {
  "field" : { "title" : "Catch-22" }
 },
 "filter" : {
  "term" : { "year" : 1961 }
 }
}

This would return all the documents with the given title, but that result would be narrowed down to only books published in 1961. This query can be rewritten as follows:

{
 "query": {
  "filtered" : {
   "query" : {
    "field" : { "title" : "Catch-22" }
   },
    "filter" : {
    "term" : { "year" : 1961 }
   }
  }
 }
}

If you run both queries by sending the curl -XGET localhost:9200/library/book/_search?pretty -d @query.json command, you will see that both responses are exactly the same (except perhaps the response time):

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2712221,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "2",
      "_score" : 0.2712221, "_source" : { "title": "Catch-22","author": "Joseph Heller","year": 1961,"characters": ["John Yossarian", "Captain Aardvark", "Chaplain Tappman", "Colonel Cathcart", "Doctor Daneeka"],"tags": ["novel"],"copies": 6, "available" : false}
    } ]
  }

This suggests that both forms are equivalent. This is not true because of the different orders of applying the filters and searching. In the first case, filters are applied to all documents found by the query. In the second case, the documents are filtered before the query runs. This yields better performance. As we said earlier, filters are fast, so a filtered query is more efficient. We will return to this in the Faceting section in Chapter 6, Beyond Searching.

Range filters

A range filter allows us to limit searching to only documents where the value of a field is between the given boundaries. For example, to construct a filter that allows only books published between 1930 and 1990, we would use the following query part:

{
 "filter" : {
  "range" : {       
   "year" : {
    "from": 1930,
    "to": 1990
    }
   }
  }
 }

By default, the left and right boundaries of the field are inclusive. If you want to exclude one or both the bounds, you can use the include_lower and/or include_upper parameters set to false. For example, if we would like to have documents from 1930 (including the ones with that value) to 1990 (excluding that value), we would construct the following filter:

{
 "filter" : {
  "range" : {       
   "year" : {
    "from": 1930,
    "to": 1990,
    "include_lower" : true,
    "include_upper" : false
   }
  }
 }
}

The other option is to use gt (greater than), lt (lower than), gte (greater than or equal to), and lte (lower than or equal to) in place of the to and from parameters. So, the preceding example may be rewritten as follows:

{
 "filter" : {
  "range" : {       
   "year" : {
    "gte": 1930,
    "lt": 1990
   }
  }
 }
}

There is also a second variant of this filter, numeric_filter. It is a specialized version designed for filtering ranges where field values are numerical. This filter is faster but comes at the cost of the additional memory used by field values. Note that sometimes these values should be loaded independently of the range filter. In those cases, there is no reason not to use this filter. This happens, for example, where this data is used in faceting or sorting (we'll discuss it in greater detail in the coming chapters).

Exists

This filter is very simple. It takes only those documents that have the given field defined, for example:

{
 "filter" : {
     "exists" : { "field": "year" }
 }
}

Missing

The missing filter is the opposite of the exists filter. However, it has a few additional features. Besides selecting the documents where the specified fields are missing, we have the possibility of defining what ElasticSearch should treat as empty. This helps in situations where input data contains tokens such as null, EMPTY, and not-defined. Let's change the preceding example to find all documents without the year field defined (or the ones that have 0 as the value of the year field. The modified filter would look like the following code:

  {
   "filter" : {
   "missing" : {
    "field": "year",
    "null_value": 0,
    "existence": true
   }
   }
  }

In the preceding example, you see two parameters in addition to the previous ones: existence,which tells ElasticSearch that it should check the documents with a value existing in the specified field, and the null_value key, which defines the additional value to be treated as empty. If you don't define null_value, existence is set by default, so you can omit existence in this case.

Script

Sometimes, we want to filter our documents by a computed value. A good example for our case can be filtering out all the books that were published more than a century ago. In order to do that, our filter must look like the following code:

 {
  "filter" : {
   "script" : {
    "script" : "now - doc['year'].value > 100",
     "params" : {
      "now" : 2013
     }
    }
   }
  }

Type

A type filter is a simple filter that returns all the documents of a given type. This filter is useful when a query is directed to several indices or an index with a large number of types. The following is an example of such filters that would limit the type of the documents to the book type:

  {
   "filter" : {
    "type": {
     "value" : "book"
    }
   }
  }

Limit

This filter limits the number of documents returned by a shard for a given query. This should not be confused with the size parameter. Let's look at the following filter:

{
 "filter" : {
  "limit" : {
   "value" : 1
  }
 }
}

When using the default settings for a number of shards, the preceding filter will return up to five documents. Why? This is connected with a number of shards (the index.number_of_shards setting). Each shard is queried separately, and each shard may return at most one document.

IDs

The IDs filter helps in cases when we have to filter out several, concrete documents. For example, if we need to exclude a document that has 1 as the identifier, the filter would look like this:

  {
   "filter": {
    "ids" : {
     "type": ["book"],
      "values": [1]
     }
    }
   }

Note that the type parameter is not required. It is only useful when we search among several indices to specify a type we are interested in.

If this is not enough

We have shown several examples for filters. But this is only the tip of the iceberg. You can wrap any query into a filter. For example, check out the following query:

{
 "query" : {
  "multi_match" : {
   "query" : "novel erich",
   "fields" : [ "tags", "author" ]
  }
 }
}

This query can be rewritten as a filter, thus:

{
"filter" : {
 "query" : {
  "multi_match" : {
   "query" : "novel erich",
   "fields" : [ "tags", "author" ]
   }
  }
 }
}

Of course, the only difference in the result will be in the scoring. Every document returned by a filter will have a score of 1.0. Note that ElasticSearch has a few dedicated filters that act this way (for example, the term query and the term filter). So, you don't always have to use a wrapped query syntax. In fact, you should always use a dedicated version wherever possible.

bool, and, or, not filters

Now it's time to combine some filters together. The first option is to use the bool filter, which can group filters on the same basis as described previously for the bool query. The second option is to use the and, or, and not filters. The first two take an array of filters and return every document that matches all of them, in the case of the and filter (or at least one filter in the case of the or filter). In the case of the not filter, returned documents are the ones that were not matched by the enclosed filter. Of course, all these filters may be nested as shown in the following example:

{
 "filter": {
  "not": {
   "and": [
    {
     "term": {
      "title": "Catch-22"
     }    
    },  
    {
     "or": [
      {
       "range": {
        "year": {
         "from": 1930,
         "to": 1990
        }       
       }      
      },    
      {
       "term": {
        "available": true
       }      
      }     
     ]    
    }   
   ]  
  }
 }
}

Named filters

Looking at how complicated setting filters may be, sometimes it would be useful to know which filters were used to determine that a document should be returned by a query. Fortunately, it is possible to give every filter a name. This name will be returned with any document matched during the query. Let's check how that works. The following query will return every book that is available and tagged as novel or every book from the nineteenth century:

{
 "query": {
  "filtered" : {
   "query": { "matchAll" : {} },
   "filter" : {
    "or" : [
     { "and" : [
      { "term": { "available" : true } },
      { "term": { "tags" : "novel" } }
      ]},  
     { "range" : { "year" : { "from": 1800, "to" : 1899 } } }
    ]   
   }  
  }
 }
}

We are using the "filtered" version of the query because this is the only version where ElasticSearch can add information about filters that were used. Let's rewrite this query and name each filter:

{
 "query": {
  "filtered" : {
   "query": { "matchAll" : {} },
   "filter" : {
    "or" : {
     "filters" : [
     {
      "and" : {
       "filters" : [
       {
        "term": {
         "available" : true,  
         "_name" : "avail"
        }       
       },     
       {
        "term": {
         "tags" : "novel",
         "_name" : "tag"
        }       
       }      
      ],    
      "_name" : "and"
      }     
     },   
     {
       "range" : {
        "year" : {
         "from": 1800,
         "to" : 1899
        },      
        "_name" : "year"
      }     
     }    
    ],  
    "_name" : "or"
    }   
   }  
  }
 }
}
}

It's much longer, isn't it? We've added the _name element to every filter. In the case of the and and or filters, we needed to change the syntax; we wrapped the enclosed filters by an additional object to make the JSON format correct. After sending a query to ElasticSearch, we should get a response similar to the following one:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 1.0, "_source" : { "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"characters": ["Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden"],"tags": ["novel"],"copies": 1, "available": true},
      "matched_filters" : [ "or", "tag", "avail", "and" ]
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 1.0, "_source" : { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true},
      "matched_filters" : [ "or", "year", "avail" ]
    } ]
  }

You can see that every document in addition to standard information also contains a an array with filter names that were matched for that document.

Caching filters

The last thing about filters that we want to mention in this chapter is caching. Caching increases speed for the queries that use filters, but at the cost of memory and query time during the first execution of such a filter. Because of this, the best candidates for caching are filters that can be reused, for example, the ones that we will use frequently (which also includes the parameter values).

Caching can be turned on for the and, bool, and or filters (but it is usually a better idea to cache the enclosed filters instead). In this case, the required syntax is the same as described for the named filters.

Some filters don't support the _cache parameter because their results are always cached. This is true for the exists, missing, range, term, and terms filters that are cached by default, but this behavior can be modified and caching can be turned off. Caching also doesn't make sense for the ids, matchAll, and limit filters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset