Chapter 3. Not Only Full Text Search

In the previous chapter, we extensively talked about querying in Elasticsearch. We started by looking at how default Apache Lucene scoring works, through how filtering works, and we've finished with looking at which query to use in a particular situation. In this chapter, we will continue with discussions regarding some of the Elasticsearch functionalities connected to both querying and data analysis. By the end of this chapter, we will have covered the following areas:

  • What query rescoring is and how you can use it to optimize your queries and recalculate the score for some documents
  • Controlling multimatch queries
  • Analyzing your data to get significant terms from it
  • Grouping your documents in buckets using Elasticsearch
  • Differences in relationship handling when using object, nested documents, and parent–child functionality
  • Extended information regarding Elasticsearch scripting such as Groovy usage and Lucene expressions

Query rescoring

One of the great features provided by Elasticsearch is the ability to change the ordering of documents after they were returned by a query. Actually, Elasticsearch does a simple trick—it recalculates the score of top matching documents, so only part of the document in the response is reordered. The reasons why we want to do that can vary. One of the reasons may be performance—for example, calculating target ordering is very costly because scripts are used and we would like to do this on the subset of documents returned by the original query. You can imagine that rescore gives us many great opportunities for business use cases. Now, let's look at this functionality and how we can benefit from using it.

What is query rescoring?

Rescore in Elasticsearch is the process of recalculating the score for a defined number of documents returned by the query. This means that Elasticsearch first takes N documents for a given query (or the post_filter phase) and calculates their score using a provided rescore definition. For example, if we would take a term query and ask for all the documents that are available, we can use rescore to recalculate the score for 100 documents only, not for all documents returned by the query. Please note that the rescore phase will not be executed when using search_type of scan or count. This means that rescore won't be taken into consideration in such cases.

An example query

Let's start with a simple query that looks as follows:

{
 "fields" : ["title", "available"],
 "query" : {
  "match_all" : {}
  }
}

It returns all the documents from the index the query is run against. Every document returned by the query will have the score equal to 1.0 because of the match_all query. This is enough to show how rescore affects our result set.

Structure of the rescore query

Let's now modify our query so that it uses the rescore functionality. Basically, let's assume that we want the score of the document to be equal to the value of the year field. The query that does that would look as follows:

{
   "fields": ["title", "available"],
   "query": {
      "match_all": {}
   },
   "rescore": {
      "query": {
         "rescore_query": {
            "function_score": {
               "query": {
                  "match_all": {}
               },
               "script_score": {
                  "script": "doc['year'].value"
               }
            }
         }
      }
   }
}

Note

Please note that you need to specify the lang property with the groovy value in the preceding query if you are using Elasticsearch 1.4 or older. What's more, the preceding example uses dynamic scripting which was enabled in Elasticsearch until versions 1.3.8 and 1.4.3 for groovy and till 1.2 for MVEL. If you would like to use dynamic scripting with groovy you should add script.groovy.sandbox.enabled property and set it to true in your elasticsearch.yml file. However, please remember that this is a security risk.

Let's now look at the preceding query in more detail. The first thing you may have noticed is the rescore object. The mentioned object holds the query that will affect the scoring of the documents returned by the query. In our case, the logic is very simple—just assign the value of the year field as the score of the document. Please also note, that when using curl you need to escape the script value, so the doc['year'].value would look like doc["year"].value

Note

In the preceding example, in the rescore object, you can see a query object. When this book was written, a query object was the only option, but in future versions, we may expect other ways to affect the resulting score.

If we save this query in the query.json file and send it using the following command:

curl localhost:9200/library/book/_search?pretty -d @query.json

The document that Elasticsearch should return should be as follows (please note that we've omitted the structure of the response so that it is as simple as it can be):

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 1962.0,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "2",
      "_score" : 1962.0,
      "fields" : {
        "title" : [ "Catch-22" ],
        "available" : [ false ]
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 1937.0,
      "fields" : {
        "title" : [ "The Complete Sherlock Holmes" ],
        "available" : [ false ]
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 1930.0,
      "fields" : {
        "title" : [ "All Quiet on the Western Front" ],
        "available" : [ true ]
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "6",
      "_score" : 1905.0,
      "fields" : {
        "title" : [ "The Peasants" ],
        "available" : [ true ]
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 1887.0,
      "fields" : {
        "title" : [ "Crime and Punishment" ],
        "available" : [ true ]
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "5",
      "_score" : 1775.0,
      "fields" : {
        "title" : [ "The Sorrows of Young Werther" ],
        "available" : [ true ]
      }
    } ]
  }
}

As we can see, Elasticsearch found all the documents from the original query. Now look at the score of the documents. Elasticsearch took the first N documents and applied the second query to them. In the result, the score of those documents is the sum of the score from the first and second queries.

As you know, scripts execution can be demanding when it comes to performance. That's why we've used it in the rescore phase of the query. If our initial match_all query would return thousands of results, calculating script-based scoring for all those can affect query performance. Rescore gave us the possibility to only calculate such scoring on the top N documents and thus reduce the performance impact.

Note

In our example, we have only seen a single rescore definition. Since Elasticsearch 1.1.0, there is a possibility of defining multiple rescore queries for a single result set. Thanks to this, you can build multilevel queries when the top N documents are reordered and this result is an input for the next reordering.

Now let's see how to tune rescore functionality behavior and what parameters are available.

Rescore parameters

In the query under the rescore object, we are allowed to use the following parameters:

  • window_size (defaults to the sum of the from and size parameters): The number of documents used for rescoring on every shard
  • query_weight (defaults to 1): The resulting score of the original query will be multiplied by this value before adding the score generated by rescore
  • rescore_query_weight (defaults to 1): The resulting score of the rescore will be multiplied by this value before adding the score generated by the original query

To sum up, the target score for the document is equal to:

original_query_score * query_weight + rescore_query_score *  rescore_query_weight

Choosing the scoring mode

By default, the score from the original query part and the score from the rescored part are added together. However, we can control that by specifying the score_mode parameter. The available values for it are as follows:

  • total: Score values are added together (the default behavior)
  • multiply: Values are multiplied by each other
  • avg: The result score is an average of enclosed scores
  • max: The result is equals of greater score value
  • min: The result is equals of lower score value

To sum up

Sometimes, we want to show results, where the ordering of the first documents on the page is affected by some additional rules. Unfortunately, this cannot be achieved by the rescore functionality. The first idea points to the window_size parameter, but this parameter, in fact, is not connected with the first documents on the result list but with the number of results returned on every shard. In addition, the window_size value cannot be less than page size (Elasticsearch will set the window_size value to the value of the size property, when window_size is lower than size). Also, one very important thing, rescoring cannot be combined with sorting because sorting is done before the changes to the documents, score are done by rescoring, and thus sorting won't take the newly calculated score into consideration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset