Influencing scores with query boosts

In the beginning of this chapter, we learned what scoring is and how Elasticsearch uses the scoring formula. When an application grows, the need for improving the quality of search also increases - we call it search experience. We need to gain knowledge about what is more important to the user and we see how the users use the searches functionality. This leads to various conclusions; for example, we see that some parts of the documents are more important than others or that particular queries emphasize one field at the cost of others. We need to include such information in our data and queries so that both sides of the scoring equation are closer to our business needs. This is where boosting can be used.

The boost

Boost is an additional value used in the process of scoring. We already know it can be applied to:

  • Query: When used, we inform the search engine that the given query is a part of a complex query and is more significant than the other parts.
  • Document: When used during indexing, we tell Elasticsearch that a document is more important than the others in the index. For example, when indexing blog posts, we are probably more interested in the posts themselves than ping backs or comments.

Values assigned by us to a query or a document are not the only factors used when we calculate the resulting score and we know that. We will now look at a few examples of query boosting.

Adding the boost to queries

Let's imagine that our index has two documents and we've used the following commands to index them:

curl -XPOST 'localhost:9200/messages/email/1' -d '{
  "id" : 1,
  "to" : "John Smith",
  "from" : "David Jones",
  "subject" : "Top secret!"
}'

curl -XPOST 'localhost:9200/messages/email/2' -d '{
  "id" : 2,
  "to" : "David Jones",
  "from" : "John Smith",
  "subject" : "John, read this document"
}'

This data is trivial, but it should describe our problem very well. Now let's assume we have the following query:

curl -XGET 'localhost:9200/messages/_search?pretty' -d '{
  "query" : {
    "query_string" : {
       "query" : "john",
       "use_dis_max" : false
    }
  }
}'

In this case, Elasticsearch will create a query to the _all field and will find documents that contain the desired words. We also said that we don't want the disjunction query to be used by specifying the use_dis_max parameter to false (if you don't remember what a disjunction query is, refer to the The dis_max query section in Chapter 3, Searching Your Data). As we can easily guess, both our records will be returned. The record with identifier equal to 2 will be first because the word John occurs two times – once in the from field and once in the subject field. Let's check this out in the following result:

"hits" : {
    "total" : 2,
    "max_score" : 0.13561106,
    "hits" : [ {
      "_index" : "messages",
      "_type" : "email",
      "_id" : "2",
      "_score" : 0.13561106,
      "_source" : {
        "id" : 2,
        "to" : "David Jones",
        "from" : "John Smith",
        "subject" : "John, read this document"
      }
    }, {
      "_index" : "messages",
      "_type" : "email",
      "_id" : "1",
      "_score" : 0.11506981,
      "_source" : {
        "id" : 1,
        "to" : "John Smith",
        "from" : "David Jones",
        "subject" : "Top secret!"
      }
    } ]
  }

Is everything all right? Technically, yes. But we think that the second document (the one with identifier 1) should be positioned as the first one in the result list, because when searching for something, the most important factor (in many cases) is matching people rather than the subject of the message. You can disagree, but this is exactly why full-text searching relevance is a difficult topic; sometimes it is hard to tell which ordering is better for a particular case. What can we do? First, let's rewrite our query to implicitly inform Elasticsearch what fields should be used for searching:

curl -XGET 'localhost:9200/messages/_search?pretty' -d '{
  "query" : {
    "query_string" : {
      "fields" : ["from", "to", "subject"],
      "query" : "john",
      "use_dis_max" : false
    }
  }
}'

This is not exactly the same query as the previous one. If we run it, we will get the same results (in our case). However, if you look carefully, you will notice differences in scoring. In the previous example, Elasticsearch only used one field, that is the default _all field. The query that we are using now is using three fields for matching. This means that several factors, such as field lengths, are changed. Anyway, this is not so important in our case. Elasticsearch under the hood generates a complex query made of three queries – one to each field. Of course, the score contributed by each query depends on the number of terms found in this field and the length of this field.

Let's introduce some differences between the fields and their importance. Compare the following query to the last one:

curl -XGET 'localhost:9200/messages/_search?pretty' -d '{
  "query" : {
    "query_string" : {
      "fields" : ["from^5", "to^10", "subject"],
      "query" : "john",
      "use_dis_max" : false
    }
  }
}'

Look at the highlighted parts (^5 and ^10). By using that notation (the ^ character followed by a number), we can inform Elasticsearch how important a given field is. We see that the most important field is the to field (because of the highest boost value). Next we have the from field, which is less important. The subject field has the default value for boost, which is 1.0 and is the least important field when it comes to score calculation. Always remember that this value is only one of the various factors. You may be wondering why we choose 5 and not 1000 or 1.23. Well, this value depends on the effect we want to achieve, what query we have, and, most importantly, what data we have in our index. Typically, when data changes in the meaningful parts, we should probably check and tune our relevance once again.

In the end, let's look at a similar example, but using the bool query:

curl -XGET 'localhost:9200/messages/_search?pretty' -d '{
 "query" : {
  "bool" : {
   "should" : [
    { "term" : { "from": { "value" : "john", "boost" : 5 }}},
    { "term" : { "to": { "value" : "john", "boost" : 10  }}},
    { "term" : { "subject": { "value" : "john" }}}
   ]
  }
 }
}'

The preceding query will yield the same results, which means that the first document on the results list will be the one with the identifier 1, but the scores will be slightly different. This is because the Lucene queries made from the last two examples are slightly different and thus the scores are different.

Modifying the score

The preceding example shows how to affect the result list by boosting particular query components – the fields. Another technique is to run a query and affect the score of the matched documents. In the following sections, we will summarize the possibilities offered by Elasticsearch. In the examples, we will use our library data that we have already used in the previous chapters.

Constant score query

A constant_score query allows us to take any query and explicitly set the value that should be used as the score that will be given for each matching document by using the boost parameter.

At first, this query doesn't seem to be practical. But when we think about building complex queries, this query allows us to set how many documents matching this query can affect the total score. Look at the following example:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
  "query" : {
    "constant_score" : {
      "query": {
        "query_string" : {
          "query" : "available:false author:heller"
        }
      }
    }
  }
}'

In our data, we have two documents with the available field set to false. One of these documents has an additional value in the author field. If we use a different query, the document with an additional value in the author field (a book with identifier 2) would be given a higher score, but, thanks to the constant score query, Elasticsearch will ignore that information during scoring. Both documents will be given a score equal to 1.0.

Boosting query

The next type of query that can be used with boosting is the boosting query. The idea is to allow us to define a part of query which will cause matched documents to have their scores lowered. The following example returns all the available books (available field set to true), but the books written by E. M. Remarque will have a negative boost of 0.1 (which means about ten times lower score):

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
  "query" : {
    "boosting" : {
      "positive" : {
        "term" : {
          "available" : true
        }
      },
      "negative" : {
        "match" : {
          "author" : "remarque"
        }
      },
      "negative_boost" : 0.1
    }
  }
}'

The function score query

Till now we've seen two examples of queries that allowed us to alter the score of the returned documents. The third example we wanted to talk about, the function_score query, is way more complicated than the previously discussed queries. The function_score query is very useful when the score calculation is more complicated than giving a single boost to all the documents; boosting more recent documents is an example of a perfect use case for the function_score query.

Structure of the function query

The structure of the function query is quite simple and looks as follows:

{
 "query" : {
  "function_score" : {
   "query" : { ... },
   "functions" : [
     {
       "filter" : { ... },
       "FUNCTION" : { ... }
     }
   ],
   "boost_mode" : " ... ",
   "score_mode" : " ... ",
   "max_boost" : " ... ",
   "min_score" : " ... ",
   "boost" : " ... "
  }
 }
}

In general, the function score query can use a query, one of several functions, and additional parameters. Each function can have a filter defined to filter the results on which it will be applied. If no filter is given for a function, it will be applied to all the documents.

The logic behind the function score query is quite simple. First of all, the functions are matched against the documents and the score is calculated based on score_mode. After that, the query score for the document is combined with the score calculated for the functions and combined together on the basis of boost_mode.

Let's now discuss the parameters:

  • Boost mode: The boost_mode parameter allows us to define how the score computed by the function queries will be combined with the score of the query. The following values are allowed:
    • multiply: The default behavior, which results in the query score being multiplied by the score computed from the functions
    • replace: The query score will be totally ignored and the document score will be equal to the score calculated by the functions
    • sum: The document score will be calculated as the sum of the query and the function scores
    • avg: The score of the document will be an average of the query score and the function score
    • max: The document will be given a maximum of query score and function score
    • min: The document will be given a minimum of query score and function score
  • Score mode: The score_mode parameter defines how the score computed by the functions are combined together. The following score_mode parameter values are defined:
    • multiply: The default behavior which results in the scores returned by the functions being multiplied
    • sum: The scores returned by the defined functions are summed
    • avg: The score returned by the functions is an average of all the scores of the matching functions
    • first: The score of the first function with a filter matching the document is returned
    • max: The maximum score of the functions is returned
    • min: The minimum score of the functions is returned

There is one thing to remember – we can limit the maximum calculated score value by using the max_boost parameter in the function score query. By default, that parameter is set to Float.MAX_VALUE, which means the maximum float value.

The boost parameter allows us to set a query wide boost for the documents.

Of course, there is one thing we should remember – the score calculated doesn't affect which documents matched the query. Because of that, the min_score property has been introduced. It allows us to define the minimum score of the documents. Documents that have a score lower than the min_score property will be excluded from the results.

What we haven't talked about yet are the function scores that we can include in the functions section of our query. The currently available functions are:

  • weight factor
  • field value factor
  • script score
  • random
  • decay

The weight factor function

The weight factor function allows us to multiply the score of the document by a given value. The value of the weight parameter is not normalized and is taken as is. An example using the weight function, where we multiply the score of the document by 20, looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "function_score" : {
   "query" : {
    "term" : {
     "available" : true
    }
   },
   "functions" : [
    { "weight" : 20 }
   ]
  }
 }
}'

Field value factor function

The field_value_factor function allows us to influence the score of the document by using a value of the field in that document. For example, to multiply the score of the document by the value of the year field, we run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "function_score" : {
   "query" : {
    "term" : {
     "available" : true
    }
   },
   "functions" : [
    { 
     "field_value_factor" : {
      "field" : "year",
      "missing" : 1
     } 
    }
   ]
  }
 }
}'

In addition to choosing the field whose value should be used, we can also control the behavior of the field value factor function by using the following properties:

  • factor: The multiplication factor that will be used along with the field value. It defaults to 1.
  • modifier: The modifier that will be applied to the field value. It defaults to none. It can take the value of log, log1p, log2p, ln, ln1p, ln2p, square, sqrt, and reciprocal.
  • missing: The value that should be used when a document doesn't have any value in the field specified in the field property.

The script score function

The script_score function allows us to use a script to calculate the score that will be used as the score returned by a function (and thus will fall into behavior defined by the boost_mode parameter). An example of script_score usage is as follows (for the following example to work, inline scripting needs to be allowed, which means adding the script.inline property and setting it to on in elasticsearch.yml):

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "function_score" : {
   "query" : {
    "term" : {
     "available" : true
    }
   },
   "functions" : [
    {
     "script_score" : {
      "script" : {
     "inline" : "_score * _source.copies * parameter1",
       "params" : {
        "parameter1" : 12
       }
      }
     }
    }
   ]
  }
 }
}'

The random score function

By using the random_score function, we can generate a pseudo random score, by specifying a seed. In order to simulate randomness, we should specify a new seed every time. The random number will be generated by using the _uid field and the provided seed. If a seed is not provided, the current timestamp will be used. An example of using this is as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "function_score" : {
   "query" : {
    "term" : {
     "available" : true
    }
   },
   "functions" : [
    {
     "random_score" : {
      "seed" : 12345
     }
    }
   ]
  }
 }
}'

Decay functions

In addition to the earlier mentioned scoring functions, Elasticsearch exposes additional ones, called the decay functions. The difference from the previously described functions is that the score given by those functions lowers with distance. The distance is calculated on the basis of a single valued numeric field (such as a date, a geographical point, or a standard numeric field). The simplest example that comes to mind is boosting documents on the basis of distance from a given point or boosting on the basis of document date.

For example, let's assume that we have a point field that stores the location and we want our document's score to be affected by the distance from a point where the user stands (for example, our user sends a query from a mobile device). Assuming the user is at 52, 21, we could send the following query:

{
 "query" : {
  "function_score" : {
   "query" : {
    "term" : {
     "available" : true
    }
   },
   "functions" : [
    {
     "linear" : {
      "point" : {
       "origin" : "52, 21",
       "scale" : "1km",
       "offset" : 0,
       "decay" : 0.2
      }
     }
    }
   ]
  }
 }
}

In the preceding example, the linear is the name of the decay function. The value will decay linearly when using it. The other possible values are gauss and exp. We've chosen the linear decay function because of the fact that it sets the score to 0 when the field value exceeds the given origin value twice. This is useful when you want to lower the value of the documents that are too far away.

Note

Note that the geographical searching capabilities of Elasticsearch will be discussed in the Geo section of Chapter 8, Beyond Full-text Searching.

Now let's discuss the rest of the query structure. The point is the name of the field we want to use for score calculation. If the document doesn't have a value in the defined field, it will be given a value of 1 for the time of calculation.

In addition to that, we've provided additional parameters. The origin and scale are required. The origin parameter is the central point from which the calculation will be done and the scale is the rate of decay. By default, the offset is set to 0. If defined, the decay function will only compute a score for the documents with value greater than the value of this parameter. The decay parameter tells Elasticsearch how much the score should be lowered and is set to 0.5 by default. In our case, we've said that, at the distance of 1 kilometer, the score should be reduced by 20% (0.2).

Note

We expect the function_score query to be modified and extended with the next versions of Elasticsearch (just as it was with Elasticsearch version 1.x). We suggest following the official documentation and the page dedicated to the function_score query at https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset