Influencing scores with query boosts

In the previous chapter, we learned how to check why the search returns a given document and what factors had influence on its position in the result list. When an application grows, the need for improving the quality of search also increases—so-called search experience. We need to gain knowledge about what is more important to the user and to see how users use the search functionality. This leads to various conclusions; for example, we see that some parts of the documents are more important than the others or that particular queries emphasize one field at the cost of others. This is where boosting can be used. In the previous chapters, we've seen some information about boosting. In this chapter, we'll summarize this knowledge and we will show how to use it in practice.

What is boost?

Boost is an additional value used in the process of scoring. We can apply this value to the following:

  • Query: This is a way to inform the search engine that the given query that is a part of a complex query is more significant than the others.
  • Field: Several document fields are important for the user. For example, searching e-mails by Bill should probably list those from Bill first, next those with Bill in subject, and then e-mails mentioning Bill in contents.
  • Document: Sometimes some documents are more important. In our example, with e-mail searching, e-mails from our friend are usually more important than e-mails from an unknown man.

Values assigned by us to a query, field, or document are only one factor used when we calculate the resulting score. We will now look at a few examples of query boosting.

Adding boost to queries

Let's imagine that our index has two documents:

{
  "id" : 1,
  "to" : "John Smith",
  "from" : "David Jones",
  "subject" : "Top secret!"
}

And:

{
  "id" : 2,
  "to" : "David Jones",
  "from" : "John Smith",
  "subject" : "John, read this document"
}

This data is trivial, but it should describe our problem very well. Now let's assume we have the following query:

{
  "query" : {
    "query_string" : {
       "query" : "john",
       "use_dis_max" : false
    }
  }
}

In this case, ElasticSearch will create a query to the _all field and will find documents that contain desired words. We also said that we don't want a disjunction query to be used by specifying the use_dis_max parameter to false (if you don't remember what a disjunction query is, please refer to the Explaining the query string section dedicated to querying a string query in Chapter 2, Searching Your Data). As we can easily guess, both our records will be returned and the record with ID equals to 2 will be first because of two occurrences of John in the from and subject fields. Let's check this out in the following result:

  "hits" : {
    "total" : 2,
    "max_score" : 0.16273327,
    "hits" : [ {
      "_index" : "messages",
      "_type" : "email",
      "_id" : "2",
      "_score" : 0.16273327, "_source" : 
      { "to" : "David Jones", "from" : 
      "John Smith", "subject" : "John, read this document"}
    }, {
      "_index" : "messages",
      "_type" : "email",
      "_id" : "1",
      "_score" : 0.11506981, "_source" : 
      { "to" : "John Smith", "from" : 
      "David Jones", "subject" : "Top secret!" }
    } ]
  }

Is everything all right? Technically, yes. But I think that the second document should be positioned as the first one in the result list, because when searching for something, the most important factor (in many cases) is matching people, rather than the subject of the message. You can disagree, but this is exactly why full-text searching relevance is a difficult topic—sometimes it is hard to tell which ordering is better for a particular case. What can we do? First, let's rewrite our query to implicitly inform ElasticSearch what fields should be used for searching:

{
  "query" : {
    "query_string" : {
      "fields" : ["from", "to", "subject"],
      "query" : "john",
      "use_dis_max" : false
    }
  }
}

This is not exactly the same query as the previous one. If we run it, we will get the same results (in our case), but if you look carefully, you will notice differences in scoring. In the previous example, ElasticSearch only used one field, _all. Now we are searching in three fields. This means that several factors, such as field lengths, are changed. Anyway, this is not so important in our case. ElasticSearch, under the hood, generates a complex query made of three queries—one to each field so that fields are treated equally. Of course, the score contributed by each query depends on the number of terms found in this field and the length of this field. Let's introduce some differences between fields. Compare the following query to the previous one:

{
  "query" : {
    "query_string" : {
      "fields" : ["from^5", "to^10", "subject"],
      "query" : "john",
      "use_dis_max" : false
    }
  }
}

Look at the highlighted parts (^5 and ^10). In this way, we can tell ElasticSearch how important a given field is. We see that the most important field is to and the from field is less important. The subject field has the default value for boost, which is 1.0. Always remember that this value is only one of various factors. You may be wondering why we choose 5, not 1000 or 1.23. Well, this value depends on the effect we want to achieve, what query we have, and most importantly, what data we have in our index. This is the important part because this means that when data changes in the meaningful parts, we should probably check and tune our relevance once again.

Finally, let's look at a similar example, but using the bool query:

{
 "query" : {
  "bool" : {
   "should" : [
    { "term" : { "from": { "value" : "john", "boost" : 5 }}},
    { "term" : { "to": { "value" : "john", "boost" : 10  }}},
    { "term" : { "subject": { "value" : "john" }}}
   ]
  }
 }
}

Modifying the score

The preceding example shows how to affect the result list by boosting particular query components. Another technique is to run a query and affect the score of the documents returned by this query. In the following sections, we will summarize the possibilities offered by ElasticSearch. In the examples, we will use our library data from the second chapter.

Constant score query

A constant score query allows us to take any filter or query and explicitly set the value that should be used as the score that will be given for each matching document.

At first, this query doesn't seem to be practical. But when we think about building complex queries, this query allows us to set how many documents matching this query can affect the total score. Look at the following example:

{
  "query" : {
    "constant_score" : {
      "query": {
        "query_string" : {
          "query" : "available:false author:heller"
        }
      },
      "boost": 5.0
    }
  }
}

In our library data that we have used, we have two documents with the available field set to false. One of these documents has an additional value in the author field. But thanks to the constant score query, ElasticSearch will ignore that information. Both documents will be given a score equal to… 1.0. Strange? Not if we think about normalization that happens during indexing. In this stage, ElasticSearch saved additional information and changed the resulting score in order for this score to be comparable with the other parts of this query. It doesn't matter if our query contains only a single part. Note that in our example, we've used a query, but we can also use a filter. This is also better for performance reasons. For clarity, let's look at the following example with a filter:

{
  "query" : {
    "constant_score" : {
      "filter": {
        "term" : {
          "available" : false
        }
      },
      "boost": 5.0
    }
  }
}

Custom boost factor query

This query is similar to the previous one. Let's start with an example:

{
  "query" : {
    "custom_boost_factor" : {
      "query": {
        "query_string" : {
          "query" : "available:false author:heller"
        }
      },
      "boost_factor": 5.0
    }
  }
}

In this case, the resulting score is multiplied by boost_factor. Unlike the previous query, this version doesn't support filters instead of queries.

Boosting query

The next type of query connected with boosting is the boosting query. The idea is to allow us to define an additional part of a query, where the score of every matched document decreases. The following example lists all the available books, but books written by E. M. Remarque will have a score of 10 times lower:

{
  "query" : {
    "boosting" : {
      "positive" : {
        "term" : {
          "available" : true
        }
      },
      "negative" : {
        "term" : {
          "author" : "remarque"
        }
      },
      "negative_boost" : 0.1
    }
  }
}

Custom score query

The custom score query gives us the simple possibility to set a score for all matched documents. It allows us to attach additional logic defined by a script to every matching document. Of course, this way of influencing a score is much slower, but sometimes it is the most convenient and the simplest way, so we should be aware of its existence and its possible usage. For example:

{
  "query" : {
    "custom_score" : {
      "query" : {
        "matchAll" : {}
      },
      "script" : "doc['copies'].value * 0.5"
    }
  },
  "fields" : ["title", "copies"]
}

This query matches all documents in the index. We've wrapped this query with the custom_score element and thanks to it, we can use an additional script for score calculation. In our example, we've used a very simple script. We just return the score as a half of the copies field value. For clarity, we use the fields element to get only the values for the title and copies fields. And now, what we obtain in return is as follows:

  "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "2",
      "_score" : 3.0,
      "fields" : {
        "title" : "Catch-22",
        "copies" : 6
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 0.5,
      "fields" : {
        "title" : "All Quiet on the Western Front",
        "copies" : 1
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.0,
      "fields" : {
        "title" : "Crime and Punishment",
        "copies" : 0
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 0.0,
      "fields" : {
        "title" : "The Complete Sherlock Holmes",
        "copies" : 0
      }
    } ]

ElasticSearch did what we wanted, but look what happens when the copies field is set to 0. The score is also 0! Normally this means that document doesn't match the query. We should remember that score manipulation doesn't allow us to reject any documents from the result.

Custom filters score query

The last query that we will discuss in this chapter is the custom filters score query. This query contains the base query and an array of filters. Each of these filters has a boost value defined. For example:

{
  "query" : {
    "custom_filters_score" : {
      "query" : {
        "matchAll" : {}
      },
      "filters" : [
        {
          "filter" : { "term" : { "available" : true }},
          "boost" : 10
        },
        {
          "filter" : { "term" : { "copies" : 0 }},
          "boost" : 100
        }

      ]
    }
  },
  "fields" : ["title", "copies", "available"]
}

We have a base query that simply selects all documents. In addition to that, we've defined two filters. The first filter selects all available books and the second one selects all the books without copies. Let's look at the result:

  "hits" : {
    "total" : 4,
    "max_score" : 100.0,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 100.0,
      "fields" : {
        "title" : "The Complete Sherlock Holmes",
        "copies" : 0,
        "available" : false
      }
    }, {
      "_index" : "library",
  "_type" : "book",
      "_id" : "4",
      "_score" : 10.0,
      "fields" : {
        "title" : "Crime and Punishment",
        "copies" : 0,
        "available" : true
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 10.0,
      "fields" : {
        "title" : "All Quiet on the Western Front",
        "copies" : 1,
        "available" : true
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "2",
      "_score" : 1.0,
      "fields" : {
        "title" : "Catch-22",
        "copies" : 6,
        "available" : false
      }
    } ]
  }

As you can see, ElasticSearch checks each document against the defined filters. When the filter matches, ElasticSearch takes the defined boost value and applies it to the resulting score. When a filter doesn't match, it takes the next filter. If none of them match, ElasticSearch returns the score from the base query. The more filters matched, the better the document is for a given query. Let's look at another example:

{
 "query" : {
  "custom_filters_score" : {
   "query" : {
    "matchAll" : {}
   },
   "filters" : [
    {
     "filter" : { "term" : { "available" : true }},
     "boost" : 10
    },
    {
     "filter" : { "term" : { "copies" : 0 }},
     "boost" : 100
    }

   ],
    "score_mode" : "total"
  }
 },
 "fields" : ["title", "copies", "available"]
}

There is only a single change, namely, the score_mode attribute. In the previous example, we used the default value, first. Now, the score_mode value of total tells ElasticSearch that all matching filters should be used for the boosting query. We've discussed different score modes in the Compound queries section in Chapter 2, Searching Your Data. However, let's check the results to illustrate this example:

  "hits" : {
    "total" : 4,
    "max_score" : 110.0,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 110.0,
      "fields" : {
        "title" : "Crime and Punishment",
        "copies" : 0,
        "available" : true
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 100.0,
      "fields" : {
        "title" : "The Complete Sherlock Holmes",
        "copies" : 0,
        "available" : false
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : 10.0,
      "fields" : {
        "title" : "All Quiet on the Western Front",
        "copies" : 1,
        "available" : true
      }
    }, {
      "_index" : "library",
      "_type" : "book",
      "_id" : "2",
      "_score" : 1.0,
      "fields" : {
        "title" : "Catch-22",
        "copies" : 6,
        "available" : false
      }
    } ]
  }

The Crime and Punishment book is available and has no copies. The resulting score reflects this fact.

The score_mode parameter gives us even more possibilities. In addition to the mentioned values, it can also take the value min, max, avg, or multiply.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset