Understanding the explain information

Compared to databases, using systems capable of performing full-text search can often be anything other than obvious. We can search in many fields simultaneously and the data in the index can vary from the ones provided as the values of the document fields (because of the analysis process, synonyms, abbreviations, and others). It's even worse! By default, search engines sort data by relevance, which means that each document is given a number indicating how similar the document is to the query. The key point here is understanding the how similar phrase. As we discussed in the beginning of the chapter, scoring takes many factors into account – how many searched words were found in the document, how frequent the word is, how many terms are in the field, and so on. This seems complicated and finding out why a document was found and why another document is better is not easy. Fortunately, Elasticsearch provides us with tools that can answer these questions and we will look at them in this section.

Understanding field analysis

One of the common questions asked when analyzing the returned documents is why a given document was not found. In many cases, the problem lies in the mappings definition and the analysis process configuration. For debugging the analysis process, Elasticsearch provides a dedicated REST API endpoint – the _analyze one.

Using it is very simple. Let's see how it is used by running a request to Elasticsearch to give us information on how the crime and punishment phrase is analyzed. To do that, we will run a command using HTTP GET to the _analyze REST end-point and we will provide the phrase as the request body. The following command does that:

curl -XGET 'localhost:9200/_analyze?pretty' -d 'Crime and Punishment'

In response, we get the following data:

{
  "tokens" : [ {
    "token" : "crime",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "and",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "punishment",
    "start_offset" : 10,
    "end_offset" : 20,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

As we can see, Elasticsearch divided the input phrase into three tokens. During processing, the phrase was divided into tokens on the basis of whitespace characters and was lowercased. This shows us exactly what would be happening during the analysis process. We can also provide the name of the analyzer. For example, we can change the preceding command to something like this:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'Crime and Punishment'

The preceding command will allow us to check how the standard analyzer analyzes the data.

It is worth noting that there is another form of analysis API available – one which allows us to provide tokenizers and filters. It is very handy when we want to experiment with configuration before creating the target mappings. Instead of specifying the analyzer parameter in the request, we provide the tokenizer and the filters parameters. We can provide a single tokenizer and a list of filters (separated by comma character). For example, to illustrate how tokenization using whitespace tokenizer works with lowercase and kstem filters we would run the following request:

curl -XGET 'localhost:9200/library/_analyze?tokenizer=whitespace&filters=lowercase,kstem&pretty' -d 'John Smith'

As we can see, an analysis API can be very useful for tracking down bugs in the mapping configuration. It is also priceless when we want to solve problems with queries and matching. It can show us how our analyzers work, what terms they produce, and what the attributes of those terms are. With such information, analyzing the query problems will be easier to track down.

Explaining the query

In addition to looking at what happened during analysis, Elasticsearch allows us to explain how the score was calculated for a particular query and document. Let's look at the following example:

curl -XGET 'localhost:9200/library/book/1/_explain?pretty&q=quiet'

The preceding request specifies a document and a query to run. The document is specified in the URI and the query is passed using the q parameter. Using the _explain endpoint, we ask Elasticsearch for an explanation about how the document was matched by Elasticsearch (or not matched). The response returned by Elasticsearch for the preceding request looks as follows:

{
  "_index" : "library",
  "_type" : "book",
  "_id" : "1",
  "matched" : true,
  "explanation" : {
    "value" : 0.057534903,
    "description" : "sum of:",
    "details" : [ {
      "value" : 0.057534903,
      "description" : "weight(_all:quiet in 0) [PerFieldSimilarity], result of:",
      "details" : [ {
        "value" : 0.057534903,
        "description" : "fieldWeight in 0, product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "tf(freq=1.0), with freq of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "termFreq=1.0",
            "details" : [ ]
          } ]
        }, {
          "value" : 0.30685282,
          "description" : "idf(docFreq=1, maxDocs=1)",
          "details" : [ ]
        }, {
          "value" : 0.1875,
          "description" : "fieldNorm(doc=0)",
          "details" : [ ]
        } ]
      } ]
    }, {
      "value" : 0.0,
      "description" : "match on required clause, product of:",
      "details" : [ {
        "value" : 0.0,
        "description" : "# clause",
        "details" : [ ]
      }, {
        "value" : 3.2588913,
        "description" : "_type:book, product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost",
          "details" : [ ]
        }, {
          "value" : 3.2588913,
          "description" : "queryNorm",
          "details" : [ ]
        } ]
      } ]
    } ]
  }
}

It can look slightly complicated and well, it is complicated. It is even worse if we realize that this is only a simple query! Elasticsearch, and more specifically the Lucene library, shows the internal information about the scoring process. We will only scratch the surface and will explain the most important things about the preceding response.

The first thing that you can notice is that for the particular query Elasticsearch provided the information if the document was a match or not. If the matched property is set to true, it means that the document was a match for the provided query.

The next important thing is the explanation object. It contains three properties: the value, the description, and the details. The value is the score calculated for the given part of the query. The description is the simplified text representation of the internal score calculation, and the details object contains detailed information about the score calculation. The nice thing is that the details object will again contain the same three properties and this is how Elasticsearch provides us with information on how the score is calculated.

For example, let's analyze the following part of the response:

    "value" : 0.057534903,
    "description" : "sum of:",
    "details" : [ {
      "value" : 0.057534903,
      "description" : "weight(_all:quiet in 0) [PerFieldSimilarity], result of:",
      "details" : [ {
        "value" : 0.057534903,
        "description" : "fieldWeight in 0, product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "tf(freq=1.0), with freq of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "termFreq=1.0",
            "details" : [ ]
          } ]
        }, {
          "value" : 0.30685282,
          "description" : "idf(docFreq=1, maxDocs=1)",
          "details" : [ ]
        }, {
          "value" : 0.1875,
          "description" : "fieldNorm(doc=0)",
          "details" : [ ]
        } ]
      } ]

The score of the element is 0.057534903 (the value property) and it is a sum of (we see that in the description property) all the inner elements. In the description on the first level of nesting of the preceding fragment, we can see that PerFieldSimilarity has been used and that the score of that element is the result of the inner elements – the second level of nesting.

On the second level of details nesting, we can see three elements. The first one shows us the score of the element, which is the product of the two scores of the elements below it. We can also see various internal statistics retrieved from the index: the term frequency which informs us how common the term is (termFreq=1.0), the inverted document frequency, which shows us how often the term appears in the documents (idf(docFreq=1, maxDocs=1)), and the field normalization factor (fieldNorm(doc=0)).

The Explain API supports the following parameters: analyze_wildcard, analyzer, default_operator, df, fields, lenient, lowercase_expanded_terms, parent, preference, routing, _source, _source_exclude, and _source_include. To learn more about all these parameters, refer to the official Elasticsearch documentation regarding Explain API, which is available at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset