Chapter 4. Make Your Search Better

In the previous chapter, we learned how to extend our index with additional information and how to handle highlighting and indexing data that is not flat. We also implemented an autocomplete mechanism using ElasticSearch, indexed files, and geographical information. However, by the end of this chapter, you will have learned the following:

  • Why your document was matched
  • How to influence document score
  • How to use synonyms
  • How to handle multilingual data
  • How to use term position aware queries (span queries)

Why this document was found

Compared to databases, using systems capable of performing full-text search can often be anything other than obvious. We can search in many fields simultaneously and the data in the index can vary from those provided for indexing because of the analysis process, synonyms, language analysis, abbreviations, and others. It's even worse; by default, search engines sort data by scoring—a number that indicates how many current documents fit into the current searching criteria. For this, "how much" is the key; search takes into consideration many factors such as how many searched words were found in the document, how frequent is this word in the whole index, and how long is the field. This seems complicated and finding out why a document was found and why another document is "better" is not easy. Fortunately, ElasticSearch has some tools that can answer these questions. Let's take a look at them!

Understanding how a field is analyzed

One of the common questions asked is why a given document was not found. In many cases, the problem lies in the definition of the mappings and the configuration of the analysis process. For debugging an analysis, ElasticSearch provides a dedicated REST API endpoint. Let's see a few examples on how to use this API.

The first query asks ElasticSearch for information about the analysis process, using the default analyzer:

curl -XGET 'localhost:9200/_analyze?pretty' -d 'Crime and Punishment'

In response, we get the following data:

{
  "tokens" : [ {
    "token" : "crime",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "punishment",
    "start_offset" : 10,
    "end_offset" : 20,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

As we can see, ElasticSearch divided the input phrase into two tokens. During processing, the and common word was omitted (because it belongs to the stop words list) and the other words were changed to lowercase versions. Now let's take a look at something more complicated. In Chapter 3, Extending Your Structure and Search, when we talked about the autocomplete feature, we used the edge engram filter. Let's recall this index and see how our analyzer works in that case:

curl -XGET 'localhost:9200/addressbook/_analyze?analyzer=autocomplete&pretty' -d 'John Smith'

In the preceding call, we used an additional parameter named analyzer, which you should already be familiar with—it tells ElasticSearch which analyzer should be used instead of the default one. Look at the returned result:

{
  "tokens" : [ {
    "token" : "joh",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "john",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "smi",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "smit",
    "start_offset" : 5,
    "end_offset" : 9,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "smith",
    "start_offset" : 5,
    "end_offset" : 10,
    "type" : "word",
    "position" : 5
  } ]
}

This time, in addition to lowercasing and splitting words, we used the edge engram filter. Our phrase was divided into tokens and lowercased. Please note that the minimum length of the generated prefixes was three letters.

It is worth noting that there is another form of analysis API available—one that allows us to provide tokenizers and filters. It is very handy when we want to experiment with a configuration before creating the target mappings. An example of such a call is as follows:

curl -XGET 'localhost:9200/addressbook/_analyze?tokenizer=whitespace&filters=lowercase,engram&pretty' -d 'John Smith'

In the preceding example, we used an analyzer that was built from the whitespace tokenizer and the two filters lowercase and engram.

As we can see, an analysis API can be very useful for tracking down bugs in the mapping configuration, but when we want to solve problems with queries and search relevance, explanation from the API is invaluable. It can show us how our analyzers work, what terms they produce, and what are the attributes of those terms. With such information, analyzing query problems will be easier to track down.

Explaining the query

Let's look at the following example:

curl -XGET 'localhost:9200/library/book/1/_explain?pretty&q=quiet'

In the preceding call, we provided a specific document and a query to run. Using the _explain endpoint, we ask ElasticSearch for an explanation about how the document was matched (or not matched) by ElasticSearch. For example, should the preceding document be found by the provided query? If it is found, ElasticSearch will provide the information why the document was matched with the details about how its score was calculated:

{
  "ok" : true,
  "matches" : true,
  "explanation" : {
    "value" : 0.057534903,
    "description" : "fieldWeight(_all:quiet in 0), product of:",
    "details" : [ {
      "value" : 1.0,
      "description" : "tf(termFreq(_all:quiet)=1)"
    }, {
      "value" : 0.30685282,
      "description" : "idf(docFreq=1, maxDocs=1)"
    }, {
      "value" : 0.1875,
      "description" : "fieldNorm(field=_all, doc=0)"
    } ]
  }
}

Looks complicated, doesn't it? Well, it is complicated and is even worse if we realize that this is only a simple query! ElasticSearch, and more specifically the Lucene library, shows the internal information of the scoring process. We will only scratch the surface and will explain the most important things.

The most important part is the total score calculated for a document. If it is equal to 0, the document won't match the given query. Another important element is the description that tells us about different scoring components. Depending on the query type, components may affect the final score in a different way. In our case, the total score is a product of the scores calculated by all the components.

The detailed information about components and knowing where we should seek for an explanation and why our document matches the query is also important. In this example, we were looking for the quiet word. It was found in the _all field. It is obvious because we searched in the default field, which is _all (you should remember from Chapter 3, Extending Your Structure and Search, that this is the field where all indexed data is copied by default to make a default search field available). In the preceding response, you can also read information about the term frequency in the given field (which was 1 in our case). This means that the field contained only a single occurrence of the searched term. And finally, the last piece of information; maxDocs equals to 1, which means that only one document was found with the specified term. This usually means that we are dealing with a small index or we've searched with the use of very rare word.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset