Highlighting

You have probably heard of highlighting or seen it. You may not even know that you are actually using highlighting when you are using the bigger and smaller public search engines on the World Wide Web (WWW). When we talk about highlighting in context of full text search, we usually mean showing which words or phrases from the query were matched in the resulting documents. For example, if we use Google and search for the word lucene, we would see that word bolded in the search results:

Highlighting

It is even more visible on the Microsoft Bing search engine:

Highlighting

In this chapter, we will see how to use Elasticsearch highlighting capabilities to enhance our application with highlighted results.

Getting started with highlighting

There is no better way of showing how highlighting works other than making a query and looking at the results returned by Elasticsearch. So let's do that. We assume that we would like to highlight the terms that are matched in the title field of our documents to increase the search experience of our users. By now you know the example data from top to bottom, so let's again reuse the same data set. We want to match the term crime in the title field and we want to get highlighting results. One of the simplest queries that can achieve this looks as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
 "query" : {
  "match" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "fields" : {
   "title" : {}
  }
 }
}'

The response for the preceding query is as follows:

{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      },
      "highlight" : {
        "title" : [ "<em>Crime</em> and Punishment" ]
      }
    } ]
  }
}

As you can see, apart from the standard information about the documents that matched the query, we got a new section called highlight. Elasticsearch used the <em> HTML tag as the beginning of the highlighting section and its closing counterpart to close the highlighted section. This is the default behavior of Elasticsearch, but we will learn how to change that.

Field configuration

In order to perform highlighting, the original content of the field needs to be present. We have to set the fields we will use for highlighting. This is done by either marking a field to be stored or using the _source field with those fields included. If the field is set to be stored in the mappings, the stored version will be used, otherwise Elasticsearch will try to use the _source field and extract the field that needs to be highlighted.

Under the hood

Elasticsearch uses Apache Lucene under the hood and highlighting is one of the features of that library. Lucene provides three types of highlighting implementation: the standard one, which we just used; the second one called FastVectorHighlighter (https://lucene.apache.org/core/5_4_0/highlighter/org/apache/lucene/search/vectorhighlight/FastVectorHighlighter.html), which needs term vectors and positions to be able to work; and the third one called PostingsHighlighter (http://lucene.apache.org/core/5_4_0/highlighter/org/apache/lucene/search/postingshighlight/PostingsHighlighter.html). Elasticsearch chooses the right highlighter implementation automatically. If the field is configured with the term_vector property set to with_positions_offsets, FastVectorHighlighter will be used. If the field is configured with the index_options property set to offsets, PostingsHighlighter will be used. Otherwise, the standard highlighter will be used by Elasticsearch.

Which highlighter to use depends on your data, your queries, and the needed performance. The standard highlighter is a general use case one. However, if you want to highlight fields with lots of data, FastVectorHighlighter is the recommended one. The thing to remember about it is that it requires term vectors to be present and that will make your index slightly larger. Finally, the fastest highlighter, that is also recommended for natural language highlighting, is PostingsHighlighter. However, the thing to remember is that PostingsHighlighter doesn't support complex queries such as the match_phrase_prefix query and in such cases highlighting won't be returned.

Forcing highlighter type

While Elasticsearch chooses the highlighter type for us, we can also enforce the highlighting type if we really want to. To do that, we need to set the type property to one of the following values:

  • plain: When this value is set, Elasticsearch will use the standard highlighter
  • fvh: When this value is set, Elasticsearch will try using FastVectorHighlighter. It will require term vectors to be turned on for the field used for highlighting.
  • postings: When this value is set, Elasticsearch will try using PostingsHighlighter. It will require offsets to be turned on for the field used for highlighting

For example, to use the standard highlighter, we will run the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 },
"highlight" : {
  "fields" : {
   "title" : { "type" : "plain" }
  }
 }
}'

Configuring HTML tags

The default behavior of highlighting mechanism may not be suited for everyone – not all of us would like to have the <em> and </em> tags to be used for highlighting. Because of that, Elasticsearch allows us to change the default behavior and change the tags that are used for that purpose. To do that, we should set the pre_tags and post_tags properties to the code snippets we want the highlighting to start from and end at; for example, by <b> and </b>. The pre_tags and post_tags properties are arrays and because of that we can provide more than a single opening and closing tag and Elasticsearch will use each of the defined tags to highlight different words. For example, if we want to use <b> as the opening tag and </b> as the closing tag, our query will look like this:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
 "query" : {
  "term" : {
   "title" : "crime"
  }
},
 "highlight" : {
  "pre_tags" : [ "<b>" ],
  "post_tags" : [ "</b>" ],
  "fields" : {
   "title" : {}
  }
 }
}'

The result returned by Elasticsearch to the preceding query will be as follows:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      },
      "highlight" : {
        "title" : [ "<b>Crime</b> and Punishment" ]
      }
    } ]
  }
}

As you can see, the term Crime in the title field was surrounded by the tags of our choice.

Controlling highlighted fragments

Elasticsearch allows us to control the number of highlighted fragments returned and their sizes by exposing two properties. The first one is number_of_fragments, which defines the number of fragments returned by Elasticsearch (defaults to 5). Setting this property to 0 causes the whole field to be returned, which can be handy for short fields but expensive for longer fields. The second property is fragment_size which lets us specify the maximum length of the highlighted fragments in characters and defaults to 100.

An example query using these properties will look as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
 "query" : {
  
"term" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "fields" : {
   "title" : { "fragment_size" : 200, "number_of_fragments" : 0 }
  }
 }
}'

Global and local settings

The highlighting properties we discussed previously can be set both on a global basis and per field basis. The global ones will be used for all the fields that don't overwrite them and should be placed on the same level as the fields section of your highlighting, for example, like this:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "pre_tags" : [ "<b>" ],
  "post_tags" : [ "</b>" ],
  
"fields" : {
   "title" : {}
  }
 }
}'

You can also set the properties for each field. For example, if we would like to keep the default behavior for all the fields except our title field, we would do the following:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "fields" : {
   "title" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] }
  }
 }
}'

As you can see, instead of placing the properties on the same level as the fields section, we placed it inside the empty JSON object that specifies the title field behavior. Of course, each field can be configured using different properties.

Require matching

Sometimes there may be a need (especially when using multiple highlighted fields) to show only the fields that matched our query. In order to have such behavior, we need to set the require_field_match property to true. Setting this property to false will cause all the terms to be highlighted even if a field didn't match the query.

To see how that works, let's create a new index called users and let's index a single document there. We will do that by sending the following command:

curl -XPUT 'http://localhost:9200/users/user/1' -d '{
 "name" : "Test user",
 "description" : "Test document"
}'

So, let's assume we want to highlight the hits in both of the preceding fields. Our command sending the query to our new index will look like this:

curl -XGET 'localhost:9200/users/_search?pretty' -d '{
 "query" : {
  "term" : {
   "name" : "test"
  }
 },
 "highlight" : {
  "fields" : {
   "name" : { "pre_tags" : ["<b>"], "post_tags" : ["</b>"] },
   "description" : { "pre_tags" : ["<b>"], "post_tags" : ["</b>"] }
  }
 }
}'

The result of the preceding query will be as follows:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "users",
      "_type" : "user",
      "_id" : "1",
      "_score" : 0.19178301,
      "_source":{
        "name" : "Test user",
        "description" : "Test document"
      },
      "highlight" : {
        "name" : [ "<b>Test</b> user" ]
      }
    } ]
  }
}

Note that we only got highlighting on the name field. This is because our query matched only that field. Let's see what will happen if we set the require_field_match property to false and use a command similar to the following one:

curl -XGET 'localhost:9200/users/_search?pretty' -d '{
 "query" : {
  "term" : {
   "name" : "test"
  }
 },
 "highlight" : {
  "require_field_match" : false,
  "fields" : {
   "name" : { "pre_tags" : ["<b>"], "post_tags" : ["</b>"] },
   "description" : { "pre_tags" : ["<b>"], "post_tags" : ["</b>"] }
  }
 }
}'

Now let's look at the modified query results:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "users",
      "_type" : "user",
      "_id" : "1",
      "_score" : 0.19178301,
      "_source":{
        "name" : "Test user",
        "description" : "Test document"
      },
      "highlight" : {
        "name" : [ "<b>Test</b> user" ],
        "description" : [ "<b>Test</b> document" ]
      }
    } ]
  }
}

As you can see, Elasticsearch returned highlighting in both the fields now.

Custom highlighting query

There are use cases where your queries are complicated and not really suitable for highlighting, but you still want to use highlighting functionality. In such cases, Elasticsearch allows us to highlight results on the basis of a different query provided using the highlight_query property. An example of using a different highlighting query looks as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "fields" : {
   "title" : { 
    "highlight_query" : {
     "term" : {
      "title" : "punishment"
     }
    }
   }
  }
 }
}'

The preceding query will result in highlighting the term punishment in the title field, instead of the crime one.

The Postings highlighter

It is time to talk about the third available highlighter. It was added in Elasticsearch 0.90.6 and is slightly different from the previous ones. PostingsHighlighter is automatically used when the field definition has index_options set to offsets. To illustrate how PostingsHighlighter works, we will create a simple index with proper configuration that allows that highlighter to work. We will do that by using the following commands:

curl -XPUT 'localhost:9200/hl_test'
curl -XPOST 'localhost:9200/hl_test/doc/_mapping' -d '{
 "doc" : {
  "properties" : {
   "contents" : {
    "type" : "string",
    "fields" : {
     "ps" : { "type" : "string", "index_options" : "offsets" }
    }
   }
  }
 }
}'

If everything goes well, we should have a new index and the mappings. The mappings have two fields defined: one named contents and the second one named contents.ps. In this second case, we turned on the offsets by using the index_options property. This means that Elasticsearch will use the standard highlighter for the contents field and the postings highlighter for the contents.ps field.

To see the difference, we will index a single document with a fragment from Wikipedia describing the history of Birmingham. We do that by running the following command:

curl -XPUT localhost:9200/hl_test/doc/1 -d '{
  "contents" : "Birmingham''s early history is that of a remote and marginal area. The main centres of population, power and wealth in the pre-industrial English Midlands lay in the fertile and accessible river valleys of the Trent, the Severn and the Avon. The area of modern Birmingham lay in between, on the upland Birmingham Plateau and within the densely wooded and sparsely populated Forest of Arden."
}'

The last step is to send a query using both the highlighters. We can do it in a single request by using the following command:

curl 'localhost:9200/hl_test/_search?pretty' -d '{
 "query": {
  "term": {
   "contents.ps": "modern"
  }
 },
 "highlight": {
  "require_field_match" : false,
  "fields": {
   "contents": {},
   "contents.ps" : {}
  }
 }
}'

If everything goes well, you will find the following snippet in the response returned by Elasticsearch:

"highlight" : {
 "contents" : [ " valleys of the Trent, the Severn and the Avon. The area of <em>modern</em> Birmingham lay in between, on the upland" ],
 "contents.ps" : [ "The area of <em>modern</em> Birmingham lay in between, on the upland Birmingham Plateau and within the densely wooded and sparsely populated Forest of Arden." ]
}

As you see, both the highlighters found the occurrence of the desired word. The difference is that the postings highlighter returns the smarter snippet – it checks for the sentence boundaries.

Let's try one more query:

curl 'localhost:9200/hl_test/_search?pretty' -d '{
 "query": {
  "match_phrase": {
   "contents.ps": "centres of"
  }
 },
 "highlight": {
  "require_field_match" : false,
  "fields": {
   "contents": {},
   "contents.ps": {}
  }
 }
}'

We searched for the phrase centres of. As you may expect, the results for the two highlighters will differ. For the standard highlighter, run on the contents field, we will find the following phrase in the response:

"Birminghams early history is that of a remote and marginal area. The main <em>centres</em> <em>of</em> population"

As you can clearly see, the standard highlighter divided the given phrase and highlighted individual terms. Also, not all occurrences of the terms centres and of were highlighted, but only the ones that formed the phrase.

On the other hand, the PostingsHighlighter returned the following highlighted fragment:

"Birminghams early history is that <em>of</em> a remote and marginal area.", "The main <em>centres</em> <em>of</em> population, power and wealth in the pre-industrial English Midlands lay in the fertile and accessible river valleys <em>of</em> the Trent, the Severn and the Avon.", "The area <em>of</em> modern Birmingham lay in between, on the upland Birmingham Plateau and within the densely wooded and sparsely populated Forest <em>of</em> Arden."

This is the significant difference. The PostingsHighlighter highlighted all the terms matching the query and not only those that formed the phrase, and returned whole sentences. This is a very nice feature, especially when you want to display the highlighting results for the user in the UI of your application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset