Improving the query relevance

Elasticsearch and search engines in general are used for searching. Of course, some use cases may require browsing some portion of the indexed data; sometimes, it is even needed to export whole query results. However, in most cases, scoring is one of the factors that play a major role in the search process. As we said in the Default Apache Lucene scoring explained section of Chapter 2, Power User Query DSL, Elasticsearch leverages the Apache Lucene library document scoring capabilities and allows you to use different query types to manipulate the score of results returned by our queries. What's more, we can change the low-level algorithm used to calculate the score that we will describe in the Altering Apache Lucene scoring section of Chapter 6, Low-level Index Control.

Given all this, when we start designing our queries, we usually go for the simplest query that returns the documents we want. However, given all the things we can do in Elasticsearch when it comes to scoring control, such queries return results that are not the best when it comes to the user search experience. This is because Elasticsearch can't guess what our business logic is and what documents are the ones that are the best from our point of view when running a query. In this section, we will try to follow a real-life example of query relevance tuning. We want to make this chapter a bit different compared to the other ones. Instead of only giving you an insight, we have decided to give you a full example of when the query tuning process may look like. Of course, remember that this is only an example and you should adjust this process to match your organization needs. Some of the examples you find in this section may be general purpose ones, and when using them in your own application, make sure that they make sense to you.

Just to give you a little insight into what is coming, we will start with a simple query that returns the results we want, we will alter the query by introducing different Elasticsearch queries to make the results better, we will use filters, we will lower the score of the documents we think of as garbage, and finally, we will introduce faceting to render drill-down menus for users to allow the narrowing of results.

Data

Of course, in order to show you the results of the query modifications that we perform, we need data. We would love to show you the real-life data we were working with, but we can't, as our clients wouldn't like this. However, there is a solution to that: for the purpose of this section, we have decided to index Wikipedia data. To do that, we will reuse the installed Wikipedia river plugin that we installed in the Correcting user spelling mistakes section earlier in this chapter.

The Wikipedia river will create the wikipedia index for us if there is not an existing one. Because we already have such an index, we will delete it. We could go with the same index, but we know that we will need to adjust the index fields, because we need some additional analysis logic, and in order to not reindex the data, we create the index upfront.

Note

Remember to remove the old river before adding the new one. To remove the old river, you should just run the following command:

curl -XDELETE 'localhost:9200/_river/wikipedia_river'

In order to reimport documents, we use the following commands:

curl -XDELETE 'localhost:9200/wikipedia'
curl -XPOST 'localhost:9200/wikipedia' -d'{
   "settings": {
      "index": {
         "analysis": {
            "analyzer": {
               "keyword_ngram": {
                  "filter": [
                     "lowercase"
                  ],
                  "tokenizer": "ngram"
               }
            }
         }
      }
   },
   "mappings": {
      "page": {
         "properties": {
            "category": {
               "type": "string",
               "fields": {
                  "untouched": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            },
            "disambiguation": {
               "type": "boolean"
            },
            "link": {
               "type": "string",
               "index": "not_analyzed"
            },
            "redirect": {
               "type": "boolean"
            },
            "redirect_page": {
               "type": "string"
            },
            "special": {
               "type": "boolean"
            },
            "stub": {
               "type": "boolean"
            },
            "text": {
               "type": "string"
            },
            "title": {
               "type": "string",
               "fields": {
                  "ngram": {
                     "type": "string",
                     "analyzer": "keyword_ngram"
                  },
                  "simple": {
                     "type": "string",
                     "analyzer": "simple"
                  }
               }
            }
         }
      }
   }
}'

For now, what we have to know is that we have a page type that we are interested in and whether that represents a Wikipedia page. We will use two fields for searching: the text and title fields. The first one holds the content of the page and the second one is responsible for holding its title.

What we have to do next is start the Wikipedia river. Because we were interested in the latest data in order to instantiate the river and start indexing, we've used the following command:

curl -XPUT 'localhost:9200/_river/wikipedia/_meta' -d '{
 "type" : "wikipedia"
}'

That's all; Elasticsearch will index the newest Wikipedia dump available to the index called wikipedia. All we have to do is wait. We were not patient, and we decided that we'll only index the first 10 million documents and, after our Wikipedia river hit that number of documents, we deleted it. We checked the final number of documents by running the following command:

curl -XGET 'localhost:9200/wikipedia/_search?q=*&size=0&pretty'

The response was as follows:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 10425136,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

We can see that we have 10,425,136 documents in the index.

Note

When running examples from this chapter, please consider the fact that the data we've indexed changes over time, so the examples shown in this chapter may result in a different document if we run it after some time.

The quest for relevance improvement

After we have our indexed data, we are ready to begin the process of searching. We will start from the beginning using a simple query that will return the results we are interested in. After that, we will try to improve the query relevance. We will also try to pay attention to performance and notice the performance changes when they are most likely to happen.

The standard query

As you know, Elasticsearch includes the content of the documents in the _all field by default. So, why do we need to bother with specifying multiple fields in a query when we can use a single one, right? Going in that direction, let's assume that we've constructed the following query and now we send it to Elasticsearch to retrieve our precious documents using the following command:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d'
{
   "query": {
      "match": {
         "_all": {
            "query": "australian system",
            "operator": "OR"
         }
      }
   }
}'

Because we are only interested in getting the title field (Elasticsearch will use the _source field to return the title field, because the title field is not stored), we've added the fields=title request parameter and, of course, we want it to be in a human-friendly formatting, so we added the pretty parameter as well.

However, the results were not as perfect as we would like them to be. The first page of documents were as follows (the whole JSON response can be found in the response_query_standard.json file provided with the book):

        Australian Honours System
        List of Australian Awards
        Australian soccer league
        Australian football league system
        AANBUS
        Australia Day Honours
        Australian rating system
        TAAATS
        Australian Arbitration system
        Western Australian Land Information System (WALIS)

While looking at the title of the documents, it seems that some of these that contain both words from the query have a lower rank than the others. Let's try to improve things.

The multi match query

What we can do first is not use the _all field at all. The reason for this is that we need to tell Elasticsearch what importance each of the fields has. For example, in our case, the title field is more important than the content of the field, which is stored in the text field. In order to inform this to ElasticSearch, we will use the multi_match query. To send such a query to Elasticsearch, we will use the following command:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d'
{
   "query": {
      "multi_match": {
         "query": "australian system",
         "fields": [
            "title^100",
            "text^10",
            "_all"
         ]
      }
   }
}'

The first page of results of the preceding query was as follows (the whole JSON response can be found in the response_query_multi_match.json file provided with the book):

        Australian Antarctic Building System
        Australian rating system
        Australian Series System
        Australian Arbitration system
        Australian university system
        Australian Integrated Forecast System
        Australian Education System
        The Australian electoral system
        Australian preferential voting system
        Australian Honours System

Instead of running the query against a single _all field, we chose to run it against the title, text, and _all fields. In addition to this, we introduced boosting: the higher the boost value, the more important the field will be (the default boost value for a field is 1.0). So, we said that the title field is more important than the text field, and the text field is more important than _all.

If you look at the results now, they seem to be a bit more relevant but still not as good as we would like them to be. For example, look at the first and second documents on the results list. The first document's title is Australian Antarctic Building System, the second document's title is Australian rating system, and so on. I would like the second document to be higher than the first one.

Phrases comes into play

The next idea that should come into our minds is the introduction of phrase queries so that we can overcome the problem that was described previously. However, we still need the documents that don't have phrases included in the results just below the ones with the phrases present. So, we need to modify our query by adding the bool query on top. Our current query will come into the must section and the phrase query will go into the should section. An example command that sends the modified query would look as follows:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d'
{
   "query": {
      "bool": {
         "must": [
            {
               "multi_match": {
                  "query": "australian system",
                  "fields": [
                     "title^100",
                     "text^10",
                     "_all"
                  ]
               }
            }
         ],
         "should": [
            {
               "match_phrase": {
                  "title": "australian system"
               }
            },
            {
               "match_phrase": {
                  "text": "australian system"
               }
            }
         ]
      }
   }
}'

Now, if we look at the top results, they are as follows (the whole response can be found in the response_query_phrase.json file provided with the book):

        Australian honours system
        Australian Antarctic Building System
        Australian rating system
        Australian Series System
        Australian Arbitration system
        Australian university system
        Australian Integrated Forecast System
        Australian Education System
        The Australian electoral system
        Australian preferential voting system

We would really like to stop further query optimization, but our results are still not as good as we would like them to be, although they are a bit better. This is because we don't have all the phrases matched. What we can do is introduce the slop parameter, which will allow us to define how many words in between can be present for a match to be considered a phrase match. For example, our australian system query will be considered a phrase match for a document with the australian education system title and with a slop parameter of 1 or more. So, let's send our query with the slop parameter present by using the following command:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d'
{
   "query": {
      "bool": {
         "must": [
            {
               "multi_match": {
                  "query": "australian system",
                  "fields": [
                     "title^100",
                     "text^10",
                     "_all"
                  ]
               }
            }
         ],
         "should": [
            {
               "match_phrase": {
                  "title": {
                     "query": "australian system",
                     "slop": 1
                  }
               }
            },
            {
               "match_phrase": {
                  "text": {
                     "query": "australian system",
                     "slop": 1
                  }
               }
            }
         ]
      }
   }
}'

Now, let's look at the results (the whole response can be found in the response_query_phrase_slop.json file provided with the book):

        Australian Honours System
        Australian honours system
        Wikipedia:Articles for deletion/Australian university system
        Australian rating system
        Australian Series System
        Australian Arbitration system
        Australian university system
        Australian Education System
        The Australian electoral system
        Australian Legal System

It seems that the results are now better. However, we can always do some more tweaking and see whether we can get some more improvements.

Let's throw the garbage away

What we can do now is that we can remove the garbage from our results. We can do this by removing redirect documents and special documents (for example, the ones that are marked for deletion). To do this, we will introduce a filter so that it doesn't mess with the scoring of other results (because filters are not scored). What's more, Elasticsearch will be able to cache filter results and reuse them in our queries and speed up their execution. The command that sends our query with filters will look as follows:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d'
{
   "query": {
      "filtered": {
         "query": {
            "bool": {
               "must": [
                  {
                     "multi_match": {
                        "query": "australian system",
                        "fields": [
                           "title^100",
                           "text^10",
                           "_all"
                        ]
                     }
                  }
               ],
               "should": [
                  {
                     "match_phrase": {
                        "title": {
                           "query": "australian system",
                           "slop": 1
                        }
                     }
                  },
                  {
                     "match_phrase": {
                        "text": {
                           "query": "australian system",
                           "slop": 1
                        }
                     }
                  }
               ]
            }
         },
         "filter": {
            "bool": {
               "must_not": [
                  {
                     "term": {
                        "redirect": "true"
                     }
                  },
                  {
                     "term": {
                        "special": "true"
                     }
                  }
               ]
            }
         }
      }
   }
}'

The results returned by it will look as follows:

        Australian honours system
        Australian Series System
        Australian soccer league system
        Australian Antarctic Building System
        Australian Integrated Forecast System
        Australian Defence Air Traffic System
        Western Australian Land Information System
        The Australian Advanced Air Traffic System
        Australian archaeology
        Australian Democrats

Isn't it better now? We think it is, but we can still make even more improvements.

Now, we boost

If you ever need to boost the importance of the phrase queries that we've introduced, we can do that by wrapping a phrase query with the function_score query. For example, if we want to have a phrase for the title field to have a boost of 1000, we need to change the following part of the preceding query:

...
{
   "match_phrase": {
      "title": {
         "query": "australian system",
         "slop": 1
      }
   }
}
...

We need to replace the preceding part of the query with the following one:

...
{
   "function_score": {
      "boost_factor": 1000,
      "query": {
         "match_phrase": {
            "title": {
               "query": "australian system",
               "slop": 1
            }
         }
      }
   }
}
...

After introducing the preceding change, the documents with phrases will be scored even higher than before, but we will leave it for you to test.

Performing a misspelling-proof search

If you look back at the mappings, you will see that we have the title field defined as multi field and one of the fields is analyzed with a defined ngram analyzer. By default, it will create bigrams, so from the system word, it will create the sy ys st te em bigrams. Imagine that we could drop some of them during searches to make our search misspelling-proof. For the purpose of showing how we can do this, let's take a simple misspelled query sent with the following command:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d'
{
   "query": {
      "query_string": {
         "query": "austrelia",
         "default_field": "title",
         "minimum_should_match": "100%"
      }
   }
}'

The results returned by Elasticsearch would be as follows:

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

We've sent a query that is misspelled against the title field and because there is no document with the misspelled term, we didn't get any results. So now, let's leverage the title.ngram field capabilities and omit some of the bigrams so that Elasticsearch can find some documents. Our command with a modified query looks as follows:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d'
{
   "query": {
      "query_string": {
         "query": "austrelia",
         "default_field": "title.ngram",
         "minimum_should_match": "85%"
      }
   }
}'

We changed the default_field property from title to title.ngram in order to inform Elasticsearch, the one with bigrams indexed. In addition to that, we've introduced the minimum_should_match property, and we've set it to 85 percent. This allows us to inform Elasticsearch that we don't want all the terms produced by the analysis process to match but only a percentage of them, and we don't care which terms these are.

Note

Lowering the value of the minimum_should_match property will give us more documents but a less accurate search. Setting the value of the minimum_should_match property to a higher one will result in the decrease of the documents returned, but they will have more bigrams similar to the query ones and, thus, they will be more relevant.

The top results returned by the preceding query are as follows (the whole result's response can be found in a file called response_ngram.json provided with the book):

        Aurelia (Australia)
        Australian Kestrel
        Austrlia
        Australian-Austrian relations
        Australia-Austria relations
        Australia–Austria relations
        Australian religion
        CARE Australia
        Care Australia
        Felix Austria

If you would like to see how to use the Elasticsearch suggester to handle spellchecking, refer to the Correcting user spelling mistakes section in this chapter.

Drill downs with faceting

The last thing we want to mention is faceting and aggregations. You can do multiple things with it, for example, calculating histograms, statistics for fields, geo distance ranges, and so on. However, one thing that can help your users get the data they are interested in is terms faceting. For example, if you go to amazon.com and enter the kids shoes query, you would see the following screenshot:

Drill downs with faceting

You can narrow down the results by the brand (the left-hand side of the page). The list of brands is not static and is generated on the basis of the results returned. We can achieve the same with terms faceting in Elasticsearch.

Note

Please note that we are showing both queries with faceting and with aggregations. Faceting is deprecated and will be removed from Elasticsearch at some point. However, we know that our readers still use it and for that, we show different variants of the same query.

So now, let's get back to our Wikipedia data. Let's assume that we like to allow our users to choose the category of documents they want to see after the initial search. In order to do that, we add the facets section to our query (however, in order to simplify the example, let's use the match_all query instead of our complicated one) and send the new query with the following command:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d '{
   "query": {
      "match_all": {}
   },
   "facets": {
      "category_facet": {
         "terms": {
            "field": "category.untouched",
            "size": 10
         }
      }
   }
}'

As you can see, we've run the facet calculation on the category.untouched field, because terms faceting is calculated on the indexed data. If we run it on the category field, we will get a single term in the faceting result, and we want the whole category to be present. The faceting section of the results returned by the preceding query looks as follows (the entire result's response can be found in a file called response_query_facets.json provided with the book):

  "facets" : {
    "category_facet" : {
      "_type" : "terms",
      "missing" : 6175806,
      "total" : 16732022,
      "other" : 16091291,
      "terms" : [ {
        "term" : "Living people",
        "count" : 483501
      }, {
        "term" : "Year of birth missing (living people)",
        "count" : 39413
      }, {
        "term" : "English-language films",
        "count" : 22917
      }, {
        "term" : "American films",
        "count" : 16139
      }, {
        "term" : "Year of birth unknown",
        "count" : 15561
      }, {
        "term" : "The Football League players",
        "count" : 14020
      }, {
        "term" : "Main Belt asteroids",
        "count" : 13968
      }, {
        "term" : "Black-and-white films",
        "count" : 12945
      }, {
        "term" : "Year of birth missing",
        "count" : 12442
      }, {
        "term" : "English footballers",
        "count" : 9825
      } ]
    }
  } 

By default, we've got the faceting results sorted on the basis of the count property, which tells us how many documents belong to that particular category. Of course, we can do the same with aggregations by using the following query:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d '{
   "query": {
      "match_all": {}
   },
   "aggs": {
      "category_agg": {
         "terms": {
            "field": "category.untouched",
            "size": 10
         }
      }
   }
}'

Now, if our user wants to narrow down its results to the English-language films category, we need to send the following query:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d '{
   "query": {
      "filtered": {
          "query" : {
              "match_all" : {}
          },
          "filter" : {
              "term": {
                  "category.untouched": "English-language films"
              } 
          }
      }
   },
   "facets": {
      "category_facet": {
         "terms": {
            "field": "category.untouched",
            "size": 10
         }
      }
   }
}'

We've changed our query to include a filter and, thus, we've filtered down the documents set on which the faceting will be calculated.

Of course, we can do the same with aggregations by using the following query:

curl -XGET 'localhost:9200/wikipedia/_search?fields=title&pretty' -d '{
   "query": {
      "filtered": {
          "query" : {
              "match_all" : {}
          },
          "filter" : {
              "term": {
                  "category.untouched": "English-language films"
              } 
          }
      }
   },
   "aggs": {
      "category_agg": {
         "terms": {
            "field": "category.untouched",
            "size": 10
         }
      }
   }
}'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset