Query rewrite

When debugging your queries, it is very valuable to know how all the queries are executed. Because of that, we decided to include the section on how query rewrite works in Elasticsearch, why it is used, and how to control it. If you have ever used queries, such as the prefix query and the wildcard query, basically any query that is said to be multiterm (a query that is built of multiple terms), you've used query rewriting even though you may not have known about it. Elasticsearch does rewrite for performance reasons. The rewrite process is about changing the original, expensive query into a set of queries that are far less expensive from an Apache Lucene point of view, thus speeding up the query execution.

Prefix query as an example

The best way to illustrate how the rewrite process is done internally is to look at an example and see which terms are used instead of the original query term. We will index three documents to our library_it index by using the following commands:

curl -XPOST 'localhost:9200/library_it/book/1' -d '{"title": "Solr 4 Cookbook"}'
curl -XPOST 'localhost:9200/library_it/book/2' -d '{"title": "Solr 3.1 Cookbook"}'
curl -XPOST 'localhost:9200/library_it/book/3' -d '{"title": "Mastering Elasticsearch"}'

What we would like is to find all the documents that start with the letter s. Simple as that, we run the following query against our library_it index:

curl -XGET 'localhost:9200/library_it/_search?pretty' -d '{
 "query" : {
  "prefix" : {
   "title" : {
    "prefix" : "s",
    "rewrite" : "constant_score_boolean"
   }
  }
 }
}'

We've used a simple prefix query; we've said that we would like to find all the documents with the letter s in the title field. We've also used the rewrite property to specify the query rewrite method, but let's skip it for now as we will discuss the possible values of this parameter in the later part of this section.

As the response to the previous query, we get the following:

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "library_it",
      "_type" : "book",
      "_id" : "2",
      "_score" : 1.0,
      "_source" : {
        "title" : "Solr 3.1 Cookbook"
      }
    }, {
      "_index" : "library_it",
      "_type" : "book",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "title" : "Solr 4 Cookbook"
      }
    } ]
  }
}

As you can see, in response we got the two documents that had the contents of the title field starting with the desired character. We didn't specify the mappings explicitly, so we relied on Elasticsearch's ability to choose the mapping type for us. As we already know, for the text field, Elasticsearch uses the default analyzer. This means that the terms in our documents will be lowercased and, because of that, we used the lowercased letter in our prefix query (remember that the prefix query is not analyzed).

Getting back to Apache Lucene

Now let's take a step back and look at Apache Lucene again. If you recall what Lucene inverted index is built from, you can tell that it contains a term, count, and document pointer (if you don't recall, refer to the Full text searching section in Chapter 1, Getting Started with Elasticsearch Cluster). So, let's see how the simplified view of the index may look for the preceding data we've put to the library_it index:

Getting back to Apache Lucene

What you see in the column with the Term text is quite important. If you look at Elasticsearch and Apache Lucene internals, you can see that our prefix query was rewritten to the following Lucene query:

ConstantScore(title:solr)

We can check the portions of the rewrite using the Elasticsearch API. First of all, we can use the Explain API by running the following command:

curl -XGET 'localhost:9200/library_it/book/1/_explain?pretty' -d '{
 "query" : {
  "prefix" : {
   "title" : {
    "prefix" : "s",
    "rewrite" : "constant_score_boolean"
   }
  }
 }
}'

The result will be as follows:

{
  "_index" : "library_it",
  "_type" : "book",
  "_id" : "1",
  "matched" : true,
  "explanation" : {
    "value" : 1.0,
    "description" : "sum of:",
    "details" : [ {
      "value" : 1.0,
      "description" : "ConstantScore(title:solr), product of:",
      "details" : [ {
        "value" : 1.0,
        "description" : "boost",
        "details" : [ ]
      }, {
        "value" : 1.0,
        "description" : "queryNorm",
        "details" : [ ]
      } ]
    }, {
      "value" : 0.0,
      "description" : "match on required clause, product of:",
      "details" : [ {
        "value" : 0.0,
        "description" : "# clause",
        "details" : [ ]
      }, {
        "value" : 1.0,
        "description" : "_type:book, product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost",
          "details" : [ ]
        }, {
          "value" : 1.0,
          "description" : "queryNorm",
          "details" : [ ]
        } ]
      } ]
    } ]
  }
}

We can see that Elasticsearch used a constant score query with the term solr against the title field.

Query rewrite properties

We can control how the queries are rewritten internally. To do that, we place the rewrite parameter inside the JSON object responsible for the actual query. For example:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
   "query" : {
    "prefix" : {
      "title" : "s",
      "rewrite" : "constant_score_boolean"
    }
  }
}'

The rewrite property can take the following values:

  • scoring_boolean: This rewrite method translates each generated term into a Boolean should clause in the Boolean query. This rewrite method causes the score to be calculated for each document. Because of that, this method may be CPU demanding. Please also note that, for queries that have many terms, it may exceed the Boolean query limit, which is set to 1024. The default Boolean query limit can be changed by setting the index.query.bool.max_clause_count property in the elasticsearch.yml file. However, remember that the more Boolean queries produced, the lower the query performance may be.
  • constant_score: This rewrite method chooses constant_score_boolean or constant_score_filter depending on the query and taking performance into consideration. This is also the default behavior when the rewrite property is not set at all.
  • constant_score_boolean: This rewrite method is similar to the scoring_boolean rewrite method described previously, but less CPU demanding because the scoring is not computed and, instead of that, each term receives a score equal to the query boost (one by default, and which can be set using the boost property). Because this rewrite method also results in Boolean should clauses being created, similar to the scoring_boolean rewrite method, this method can also hit the maximum Boolean clauses limit.
  • top_terms_N: A rewrite method that translates each generated term into a Boolean should clause in a Boolean query and keeps the scores as computed by the query. However, unlike the scoring_boolean rewrite method, it only keeps an N number of top scoring terms to avoid hitting the maximum Boolean clauses limit and increase the final query performance.
  • top_terms_blended_freqs_N: A rewrite method that translates each term into a Boolean query and treat the terms as if they had the same term frequency.
  • top_terms_boost_N: A rewrite method similar to the top_terms_N one, but the scores are not computed. Instead, the documents are given a score equal to the value of the boost property (one by default).

For example, if we would like our example query to use top_terms_N with N equal to 2, our query would look like this:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "prefix" : {
     "title" : {
      "prefix" :"s",
      "rewrite" : "top_terms_2"
     }
    }
  }
}'

If you look at the results returned by Elasticsearch, you'll notice that, unlike our initial query, the documents were given a score different than the default 1.0:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.15342641,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 0.15342641,
      "_source" : {
        "title" : "The Complete Sherlock Holmes",
        "author" : "Arthur Conan Doyle",
        "year" : 1936,
        "characters" : [ "Sherlock Holmes", "Dr. Watson", "G. Lestrade" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : false,
        "section" : 12
      }
    } ]
  }
}

The score is different than the default 1.0 because we've used the top_terms_N rewrite type and this type of query rewrite keeps the score for N top scoring terms.

Before we finish the Query rewrite section of this chapter, we should ask ourselves one last question: when to use which rewrite type? The answer to this question greatly depends on your use case, but, to summarize, if you can live with lower precision and relevancy (but higher performance), you can go for the top N rewrite method. If you need high precision and thus more relevant queries (but lower performance), choose the Boolean approach.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset