Chapter 4. Extending Your Querying Knowledge

In the previous chapter, we dived into Elasticsearch querying capabilities. We discussed how to query Elasticsearch in detail and we learned how Elasticsearch querying works. We now know the basic and compound queries of this great search engine and what are the configuration options for each query type. We also got to know when to use our queries and we discussed a few use cases and which queries can be used to handle them. This chapter is dedicated to extending our querying knowledge. By the end of this chapter, you will have learned the following topics:

  • What filtering is and how to use it
  • What highlighting is and how to use it
  • What are the highlighter types and what benefits they bring
  • How to validate your queries
  • How to sort your query results
  • What query rewrite is and how to control it

Filtering your results

In the previous chapter, we talked about various types of queries. The common part was that we always wanted to get the best results first. This is the main difference from the standard database approach where every document matches the query or not. In the database world, we do not ask how good the document is; our only interest lies in the results returned. When talking about full text search engines this is different – we are interested not only in the results, we are also interested in their quality. The reason is obvious, we are searching in unstructured data, using text fields that use language analysis, stemming, and so on. Because of that, the initial results of our queries, in most cases, give results that are far from optimal. This is why when we talk about searching, we talk about precision and document recall.

On the other hand, sometimes we want to limit the whole subset of documents to a chosen part. For example, in a library, we may want to search only the available books, the rest being unimportant. Sometimes the score, busily calculated for the given fields, only interferes with the overall score and has no meaning in terms of accuracy. In such cases, filters should be used to limit the results of the query, but not interfere with the calculated score.

Prior to Elasticsearch 2.0, filters were independent entities from queries. In practice, almost every query had its own counterpart in filters. There was the term query and the term filter, the bool query and the bool filter, the range query and the range filter, and so on. From the user point of view, the most important difference between the queries and the filters was scoring. The filter didn't calculate score, which resulted in the filter being easily cached and more efficient. But this difference was very inconvenient for users. With the release of Elasticsearch 2.0 and its usage of Lucene 5.3, filter queries were deprecated along with some types of queries that allowed us to use filters. Let's discuss how filtering works now and what we can do to achieve the same or better performance as before in Elasticsearch 2.0.

The context is the key

In Elasticsearch 2.0, queries can calculate score or omit it by choosing more efficient way of execution. This behavior, in many cases, is done automatically based on the context where the query is used. This is about the queries that include filter sections, which remove the documents based on some criteria. These documents are unnecessary in the returned results and should be skipped as quickly as possible without affecting the overall score. Thanks to this, after discarding some documents we can focus only on the rest of the documents, calculating their scores, and sorting them before returning. The example of this case can be the must_not clause of a Boolean query. The document that matches the must_not clause will be removed from the returned result set, so calculating the score for the documents matched by this part of the bool query would be an additional, unnecessary, and performance ineffective work.

The best thing about all the changes is that we don't need to care about if we want to use filtering or not. Elasticsearch and the underlying Apache Lucene library take care of choosing the right execution method for us.

Explicit filtering with bool query

As we mentioned in the Compound queries section in Chapter 3, Searching Your Data, the bool query in Elasticsearch 2.0 allows us to add a filter explicitly by adding the filter section and including a query in that section. This is very convenient if we want to have a part of the query that needs to match, but we are not interested in the score for those documents.

Let's look at the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "term" : {
      "available" : true
    }
  }
}'

We see a simple query that should return all the books in our library available for borrowing, which means the documents with the available field set to true. Now let's compare it with the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "bool" : {
      "must" : {
        "match_all" : { }
      },
      "filter" : {
        "term" : {
          "available" : true
         }
      }
    }
  }
}'

This query returns all the books, but it also contains the filter section, which tells Elasticsearch that we are only interested in the available books. The query will return the same results as the previous query we've seen, of course when looking only at the number of documents and which documents are returned. The difference is the score. For our example data, both the queries return two books. The results returned for the first query look as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 1.0,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 0.30685282,
      "_source" : {
        "title" : "All Quiet on the Western Front",
        "otitle" : "Im Westen nichts Neues",
        "author" : "Erich Maria Remarque",
        "year" : 1929,
        "characters" : [ "Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden" ],
        "tags" : [ "novel" ],
        "copies" : 1,
        "available" : true,
        "section" : 3
      }
    } ]
  }
}

The results for the second query look as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 1.0,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "All Quiet on the Western Front",
        "otitle" : "Im Westen nichts Neues",
        "author" : "Erich Maria Remarque",
        "year" : 1929,
        "characters" : [ "Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden" ],
        "tags" : [ "novel" ],
        "copies" : 1,
        "available" : true,
        "section" : 3
      }
    } ]
  }
}

If you look at the score for the documents in each query, you'll notice the difference. In the simple term query, Elasticsearch (the Lucene library, in fact) has a score of 1.0 for the first document and a score of 0.30685282 for the second one. This is not a perfect solution because the availability check is more or less binary and we don't want it to interfere with the score. That's why the second query is better in this case. With the bool query and filtering, the score for the filter element is not calculated and the score for both the documents is the same, that is 1.0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset