Query rewrite explained

We have already talked about scoring, which is valuable knowledge, especially when trying to improve the relevance of our queries. We also think that when debugging your queries, it is valuable to know how all the queries are executed; therefore, it is because of this we decided to include this section on how query rewrite works in Elasticsearch, why it is used, and how to control it.

If you have ever used queries, such as the prefix query and the wildcard query, basically any query that is said to be multiterm, you've probably heard about query rewriting. Elasticsearch does that because of performance reasons. The rewrite process is about changing the original, expensive query to a set of queries that are far less expensive from Lucene's point of view and thus speed up the query execution. The rewrite process is not visible to the client, but it is good to know that we can alter the rewrite process behavior. For example, let's look at what Elasticsearch does with a prefix query.

Prefix query as an example

The best way to illustrate how the rewrite process is done internally is to look at an example and see what terms are used instead of the original query term. Let's say we have the following data in our index:

curl -XPUT 'localhost:9200/clients/client/1' -d '{
  "id":"1", "name":"Joe"
}'
curl -XPUT 'localhost:9200/clients/client/2' -d '{
  "id":"2", "name":"Jane"
}'
curl -XPUT 'localhost:9200/clients/client/3' -d '{
  "id":"3", "name":"Jack"
}'
curl -XPUT 'localhost:9200/clients/client/4' -d '{
  "id":"4", "name":"Rob"
}'

We would like to find all the documents that start with the j letter. As simple as that, we run the following query against our clients index:

curl -XGET 'localhost:9200/clients/_search?pretty' -d '{
 "query" : {
  "prefix" : {
   "name" : {
    "prefix" : "j",
    "rewrite" : "constant_score_boolean"
   }
  }
 }
}'

We've used a simple prefix query; we've said that we would like to find all the documents with the j letter in the name field. We've also used the rewrite property to specify the query rewrite method, but let's skip it for now, as we will discuss the possible values of this parameter in the later part of this section.

As the response to the previous query, we've got the following:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "clients",
      "_type" : "client",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{
  "id":"3", "name":"Jack"
}
    }, {
      "_index" : "clients",
      "_type" : "client",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{
  "id":"2", "name":"Jane"
}
    }, {
      "_index" : "clients",
      "_type" : "client",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{
  "id":"1", "name":"Joe"
}
    } ]
  }
}

As you can see, in response we've got the three documents that have the contents of the name field starting with the desired character. We didn't specify the mappings explicitly, so Elasticsearch has guessed the name field mapping and has set it to string-based and analyzed. You can check this by running the following command:

curl -XGET 'localhost:9200/clients/client/_mapping?pretty'

Elasticsearch response will be similar to the following code:

{
  "client" : {
    "properties" : {
      "id" : {
        "type" : "string"
      },
      "name" : {
        "type" : "string"
      }
    }
  }
}

Getting back to Apache Lucene

Now let's take a step back and look at Apache Lucene again. If you recall what Lucene inverted index is built of, you can tell that it contains a term, a count, and a document pointer (if you can't recall, please refer to the Introduction to Apache Lucene section in Chapter 1, Introduction to Elasticsearch). So, let's see how the simplified view of the index may look for the previous data we've put to the clients index, as shown in the following figure:

Getting back to Apache Lucene

What you see in the column with the term text is quite important. If we look at Elasticsearch and Apache Lucene internals, you can see that our prefix query was rewritten to the following Lucene query:

ConstantScore(name:jack name:jane name:joe)

We can check the portions of the rewrite using the Elasticsearch API. First of all, we can use the Explain API by running the following command:

curl -XGET 'localhost:9200/clients/client/1/_explain?pretty' -d '{
 "query" : {
  "prefix" : {
   "name" : {
    "prefix" : "j",
    "rewrite" : "constant_score_boolean"
   }
  }
 }
}'

The result would be as follows:

{
  "_index" : "clients",
  "_type" : "client",
  "_id" : "1",
  "matched" : true,
  "explanation" : {
    "value" : 1.0,
    "description" : "ConstantScore(name:joe), product of:",
    "details" : [ {
      "value" : 1.0,
      "description" : "boost"
    }, {
      "value" : 1.0,
      "description" : "queryNorm"
    } ]
  }
}

We can see that Elasticsearch used a constant score query with the joe term against the name field. Of course, this is on Lucene level; Elasticsearch actually used a cache to get the terms. We can see this by using the Validate Query API with a command that looks as follows:

curl -XGET 'localhost:9200/clients/client/_validate/query?explain&pretty' -d '{
 "query" : {
  "prefix" : {
   "name" : {
    "prefix" : "j",
    "rewrite" : "constant_score_boolean"
   }
  }
 }
}'

The result returned by Elasticsearch would look like the following:

{
  "valid" : true,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "explanations" : [ {
    "index" : "clients",
    "valid" : true,
    "explanation" : "filtered(name:j*)->cache(_type:client)"
  } ]
}

Query rewrite properties

Of course, the rewrite property of multiterm queries can take more than a single constant_score_boolean value. We can control how the queries are rewritten internally. To do that, we place the rewrite parameter inside the JSON object responsible for the actual query, for example, like the following code:

{
   "query" : {
    "prefix" : {
      "name" : "j",
      "rewrite" : "constant_score_boolean"
    }
  }
}

The rewrite property can take the following values:

  • scoring_boolean: This rewrite method translates each generated term into a Boolean should clause in a Boolean query. This rewrite method causes the score to be calculated for each document. Because of that, this method may be CPU demanding and for queries that many terms may exceed the Boolean query limit, which is set to 1024. The default Boolean query limit can be changed by setting the index.query.bool.max_clause_count property in the elasticsearch.yml file. However, please remember that the more Boolean queries are produced, the lower the query performance may be.
  • constant_score_boolean: This rewrite method is similar to the scoring_boolean rewrite method described previously, but is less CPU demanding because scoring is not computed, and instead of that, each term receives a score equal to the query boost (one by default and can be set using the boost property). Because this rewrite method also results in Boolean should clauses being created, similar to the scoring_boolean rewrite method, this method can also hit the maximum Boolean clauses limit.
  • constant_score_filter: As Apache Lucene Javadocs state, this rewrite method rewrites the query by creating a private filter by visiting each term in a sequence and marking all documents for that term. Matching documents are given a constant score equal to the query boost. This method is faster than the scoring_boolean and constant_score_boolean methods, when the number of matching terms or documents is not small.
  • top_terms_N: A rewrite method that translates each generated term into a Boolean should clause in a Boolean query and keeps the scores as computed by the query. However, unlike the scoring_boolean rewrite method, it only keeps the N number of top scoring terms to avoid hitting the maximum Boolean clauses limit and increase the final query performance.
  • top_terms_boost_N: It is a rewrite method similar to the top_terms_N one, but the scores are not computed, but instead the documents are given the score equal to the value of the boost property (one by default).

Note

When the rewrite property is set to constant_score_auto value or not set at all, the value of constant_score_filter or constant_score_boolean will be used depending on the query and how it is constructed.

For example, if we would like our example query to use the top_terms_N with N equal to 2, our query would look like the following:

{
  "query" : {
    "prefix" : {
     "name" : {
      "prefix" :"j",
      "rewrite" : "top_terms_2"
     }
    }
  }
}

If you look at the results returned by Elasticsearch, you'll notice that unlike our initial query, the documents were given a score different than the default 1.0:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "clients",
      "_type" : "client",
      "_id" : "3",
      "_score" : 0.30685282,
      "_source":{
  "id":"3", "name":"Jack"
}
    }, {
      "_index" : "clients",
      "_type" : "client",
      "_id" : "2",
      "_score" : 0.30685282,
      "_source":{
  "id":"2", "name":"Jane"
}
    }, {
      "_index" : "clients",
      "_type" : "client",
      "_id" : "1",
      "_score" : 0.30685282,
      "_source":{
  "id":"1", "name":"Joe"
}
    } ]
  }
}

This is because the top_terms_N keeps the score for N top scoring terms.

Before we finish the query rewrite section of this chapter, we should ask ourselves one last question: when to use which rewrite types? The answer to this question greatly depends on your use case, but to summarize, if you can live with lower precision and relevancy (but higher performance), you can go for the top N rewrite method. If you need high precision and thus more relevant queries (but lower performance), choose the Boolean approach.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset