In the previous chapter, we extensively talked about querying in Elasticsearch. We started by looking at how default Apache Lucene scoring works, through how filtering works, and we've finished with looking at which query to use in a particular situation. In this chapter, we will continue with discussions regarding some of the Elasticsearch functionalities connected to both querying and data analysis. By the end of this chapter, we will have covered the following areas:
One of the great features provided by Elasticsearch is the ability to change the ordering of documents after they were returned by a query. Actually, Elasticsearch does a simple trick—it recalculates the score of top matching documents, so only part of the document in the response is reordered. The reasons why we want to do that can vary. One of the reasons may be performance—for example, calculating target ordering is very costly because scripts are used and we would like to do this on the subset of documents returned by the original query. You can imagine that rescore gives us many great opportunities for business use cases. Now, let's look at this functionality and how we can benefit from using it.
Rescore in Elasticsearch is the process of recalculating the score for a defined number of documents returned by the query. This means that Elasticsearch first takes N
documents for a given query (or the post_filter
phase) and calculates their score using a provided rescore definition. For example, if we would take a term
query and ask for all the documents that are available, we can use rescore to recalculate the score for 100 documents only, not for all documents returned by the query. Please note that the rescore phase will not be executed when using search_type
of scan
or count
. This means that rescore won't be taken into consideration in such cases.
Let's start with a simple query that looks as follows:
{ "fields" : ["title", "available"], "query" : { "match_all" : {} } }
It returns all the documents from the index the query is run against. Every document returned by the query will have the score equal to 1.0
because of the match_all
query. This is enough to show how rescore affects our result set.
Let's now modify our query so that it uses the rescore functionality. Basically, let's assume that we want the score of the document to be equal to the value of the year
field. The query that does that would look as follows:
{ "fields": ["title", "available"], "query": { "match_all": {} }, "rescore": { "query": { "rescore_query": { "function_score": { "query": { "match_all": {} }, "script_score": { "script": "doc['year'].value" } } } } } }
Please note that you need to specify the lang
property with the groovy
value in the preceding query if you are using Elasticsearch 1.4 or older. What's more, the preceding example uses dynamic scripting which was enabled in Elasticsearch until versions 1.3.8 and 1.4.3 for groovy and till 1.2 for MVEL. If you would like to use dynamic scripting with groovy you should add script.groovy.sandbox.enabled
property and set it to true
in your elasticsearch.yml
file. However, please remember that this is a security risk.
Let's now look at the preceding query in more detail. The first thing you may have noticed is the rescore
object. The mentioned object holds the query that will affect the scoring of the documents returned by the query. In our case, the logic is very simple—just assign the value of the year
field as the score of the document. Please also note, that when using curl you need to escape the script value, so the doc['year'].value
would look like doc["year"].value
If we save this query in the query.json
file and send it using the following command:
curl localhost:9200/library/book/_search?pretty -d @query.json
The document that Elasticsearch should return should be as follows (please note that we've omitted the structure of the response so that it is as simple as it can be):
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 6, "max_score" : 1962.0, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "2", "_score" : 1962.0, "fields" : { "title" : [ "Catch-22" ], "available" : [ false ] } }, { "_index" : "library", "_type" : "book", "_id" : "3", "_score" : 1937.0, "fields" : { "title" : [ "The Complete Sherlock Holmes" ], "available" : [ false ] } }, { "_index" : "library", "_type" : "book", "_id" : "1", "_score" : 1930.0, "fields" : { "title" : [ "All Quiet on the Western Front" ], "available" : [ true ] } }, { "_index" : "library", "_type" : "book", "_id" : "6", "_score" : 1905.0, "fields" : { "title" : [ "The Peasants" ], "available" : [ true ] } }, { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 1887.0, "fields" : { "title" : [ "Crime and Punishment" ], "available" : [ true ] } }, { "_index" : "library", "_type" : "book", "_id" : "5", "_score" : 1775.0, "fields" : { "title" : [ "The Sorrows of Young Werther" ], "available" : [ true ] } } ] } }
As we can see, Elasticsearch found all the documents from the original query. Now look at the score of the documents. Elasticsearch took the first N documents and applied the second query to them. In the result, the score of those documents is the sum of the score from the first and second queries.
As you know, scripts execution can be demanding when it comes to performance. That's why we've used it in the rescore phase of the query. If our initial match_all
query would return thousands of results, calculating script-based scoring for all those can affect query performance. Rescore gave us the possibility to only calculate such scoring on the top N documents and thus reduce the performance impact.
In our example, we have only seen a single rescore definition. Since Elasticsearch 1.1.0, there is a possibility of defining multiple rescore queries for a single result set. Thanks to this, you can build multilevel queries when the top N documents are reordered and this result is an input for the next reordering.
Now let's see how to tune rescore functionality behavior and what parameters are available.
In the query under the rescore
object, we are allowed to use the following parameters:
window_size
(defaults to the sum of the from
and size
parameters): The number of documents used for rescoring on every shardquery_weight
(defaults to 1
): The resulting score of the original query will be multiplied by this value before adding the score generated by rescorerescore_query_weight
(defaults to 1
): The resulting score of the rescore will be multiplied by this value before adding the score generated by the original queryTo sum up, the target score for the document is equal to:
original_query_score * query_weight + rescore_query_score * rescore_query_weight
By default, the score from the original query part and the score from the rescored part are added together. However, we can control that by specifying the score_mode
parameter. The available values for it are as follows:
total
: Score values are added together (the default behavior)multiply
: Values are multiplied by each otheravg
: The result score is an average of enclosed scoresmax
: The result is equals of greater score valuemin
: The result is equals of lower score valueSometimes, we want to show results, where the ordering of the first documents on the page is affected by some additional rules. Unfortunately, this cannot be achieved by the rescore functionality. The first idea points to the window_size
parameter, but this parameter, in fact, is not connected with the first documents on the result list but with the number of results returned on every shard. In addition, the window_size
value cannot be less than page size (Elasticsearch will set the window_size
value to the value of the size
property, when window_size
is lower than size
). Also, one very important thing, rescoring cannot be combined with sorting because sorting is done before the changes to the documents, score are done by rescoring, and thus sorting won't take the newly calculated score into consideration.