Highlighting

You have probably heard of highlighting or even if you are not familiar with the name you've probably seen highlighting results on the usual web pages that you visit. Highlighting is the process of showing which word or words for the query were matched in the resulting documents. For example, if we search Google for the word lucene you will see it in bold in the results list, for example:

Highlighting

In this chapter, we will see how to use the ElasticSearch highlighting capabilities to enhance our application with highlighted results.

Getting started with highlighting

There is no better way of showing how highlighting works than making a query and looking at the results returned by ElasticSearch. So let's do that. Let's assume that we would like to highlight the words that were matched in the title field of our documents to increase the search experience of our users. We are again looking for the word crime and we would like to get highlighted results, so the following query would have to be sent:

{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "fields" : {
   "title" : {}
  }
 }
}

The response for such a query would be as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.19178301, "_source" : { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true},
      "highlight" : {
        "title" : [ "<em>Crime</em> and Punishment" ]
      }
    } ]
  }
}

As you can see, apart from the standard information we got from ElasticSearch, there is a new section called highlight. Here, ElasticSearch used the <em> HTML tag as the beginning of the highlight section and its closing counterpart to close the section. This is the default behavior of ElasticSearch, but we will learn how to change that.

Field configuration

In order to perform highlighting, the original content of the field needs to be present—we have to set to store the fields that we will use for highlighting. However, it is possible to use the _source field if fields are not stored and ElasticSearch will use one or the other automatically.

Under the hood

ElasticSearch uses Apache Lucene under the hood and highlighting is one of the features of that library. Lucene provides two types of highlighting implementation—the standard one, which we just used and the second one called FastVectorHighlighter , which needs term vectors and positions to be able to work. ElasticSearch chooses the right highlighter implementation automatically. If the field is configured with the term_vector property set to with_positions_offsets, FastVectorHighlighter will be used; otherwise the default Lucene highlighter will be used.

However, you have to remember that having term vectors will cause your index to be larger, but the highlighting will take less time to be executed. Also, FastVectorHighlighter is recommended for fields that store a lot of data in them.

Configuring HTML tags

As we already mentioned, it is possible to change the default HTML tags to the ones we would like to use. For example, let's assume that we would like to use the standard HTML <b> tag for highlighting. In order to do that, we should set the pre_tags and post_tags properties (those are arrays) to <b> and </b>. Because both of these properties are arrays, we can include more than one tag and ElasticSearch will use each of the defined tags to highlight different words. So our example query would be like the following:

{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "pre_tags" : [ "<b>" ],
  "post_tags" : [ "</b>" ],
  "fields" : {
   "title" : {}
  }
 }
}

The result returned by ElasticSearch to the previous query would be the following:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.19178301, "_source" : { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true},
      "highlight" : {
        "title" : [ "<b>Crime</b> and Punishment" ]
      }
    } ]
  }
}

As you can see, the word Crime in title was surrounded by the tags of our choice.

Controlling highlighted fragments

ElasticSearch allows us to control the number of highlighted fragments returned, their sizes, and also exposes two of the properties we are allowed to use. The first one, number_of_fragments, defines the number of fragments returned by ElasticSearch (it defaults to 5). Setting this property to 0 causes the whole field to be returned, which can be handy for short fields; however, it can be expensive for longer fields.

The second property, fragment_size, lets us specify the maximum length of the highlighted fragments in characters and defaults to 100.

Global and local settings

The highlighting properties we discussed previously can be set both on a global basis and on a per field basis. The global ones will be used for all the fields that don't overwrite them and should be placed on the same level as the fields section of your highlighting, for example:

{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "pre_tags" : [ "<b>" ],
  "post_tags" : [ "</b>" ],
  "fields" : {
   "title" : {}
  }
 }
}

You can also set the properties for each field. For example, if we would like to keep the default behavior for all the fields except our title field, we would do the following:

{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 },
 "highlight" : {
  "fields" : {
   "title" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] }
  }
 }
}

As you can see, instead of placing the properties on the same level as the fields section we placed them inside the empty JSON object that specifies the field title's behavior. Of course, each field can be configured to use different properties.

Require matching

One last thing about highlighting, sometimes there may be a need (especially when using multiple highlighted fields) to show only the fields that matched our query. In order to have such behavior, we need to set the require_field_match property to true. Setting this property to false will cause all the terms to get highlighted even if a field didn't match the query.

To see how that works, let's create a new index called users and let's index a single document there. We will do that by sending the following command:

curl -XPUT 'http://localhost:9200/users/user/1' -d '{
 "name" : "Test user",
 "description" : "Test document"
}'

So, let's assume we want to highlight the hits in both of the previous fields, so our query will look like the following:

{
 "query" : {
  "term" : {
   "name" : "test"
  }
 },
 "highlight" : {
  "fields" : {
   "name" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] },
   "description" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] }
  }
 }
}

The result of the query would be as follows:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "users",
      "_type" : "user",
      "_id" : "1",
      "_score" : 0.19178301, "_source" : {"name" : "Test user","description" : "Test document"},
      "highlight" : {
        "description" : [ "<b>Test</b> document" ],
        "name" : [ "<b>Test</b> user" ]
      }
    } ]
  }
}

Notice, that even though we only matched the name field, we got highlighting results in both fields. In most cases we want to avoid that. So now let's modify our query to use the require_field_match property:

{
 "query" : {
  "term" : {
   "name" : "test"
  }
 },
 "highlight" : {
  "require_field_match" : true,
  "fields" : {
   "name" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] },
   "description" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] }
  }
 }
}

And now let's look at the modified query results:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "users",
      "_type" : "user",
      "_id" : "1",
      "_score" : 0.19178301, "_source" : {"name" : "Test user","description" : "Test document"},
      "highlight" : {
        "name" : [ "<b>Test</b> user" ]
      }
    } ]
  }
}

As you can see, ElasticSearch returned only the field that was matched, in our case the name field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset