Autocomplete

Modern searching doesn't go without the autocomplete functionality. Thanks to it, users are provided with a convenient way to find items whose spelling isn't known. Autocomplete can also be a good marketing tool. For these reasons, sooner or later, you'll want to know how to implement this feature.

Before we configure autocomplete, we should ask ourselves a few questions: what data do we want to use for suggestions? Do we have a set of suggestions already prepared (such as country names) or do we want to generate them dynamically, based on indexed documents? Do we want to suggest words or whole documents? Do we need information about the number of suggested items? And finally, do we want to display only one field from the document or a few (for example, product name and price)? Each possible solution has its pros and cons and supports one requirement at the cost of another. Now, let's go through three common ways to implement the autocomplete functionality in ElasticSearch.

The prefix query

The simplest way of building an autocomplete solution is using a prefix query, which we have already discussed. For example, if we want to suggest country names, we just index them (for example, into the country_code field) and search like the following:

curl -XGET 'localhost:9200/countries/_search' -d '
{ 
 "query" : {
  "prefix" : {
   "country": "r"
  } 
 } 
}'

This returns every country that starts with the letter r. This is very simple, but not ideal. If you have more data, you will notice that the prefix query is expensive. It is not suitable for open datasets where individual values can be repeated. Fortunately, if we run into performance problems, we can modify this method to the one using edge ngrams.

Edge ngrams

The prefix query works well; but in order for it to work, ElasticSearch must iterate through the list of terms to find the ones that match the given prefix. The idea for optimizing this is quite simple. Since finding particular terms is less costly, we can split terms into smaller parts. For example, the word Britain can be stored as a series of terms such as Bri, Brit, Brita, Britai, and Britain. Thanks to this, we can find documents containing the whole word only by supplying a part of that word. You may wonder why we start with three-letter tokens. In real-case scenarios, suggestions for shorter user input is not very useful due to too many suggestions being returned.

Let's see a full-index configuration for a simple address book application:

{
 "settings" : {
   "index" : {
      "analysis" : {
        "analyzer" : {
          "autocomplete" : {
            "tokenizer" : "engram",
            "filter" : ["lowercase"]
          }
        },
        "tokenizer" : {
          "engram" : {
            "type" : "edgeNGram",
            "min_gram" : 3,
            "max_gram" : 10
         }
        }
      }
    }
  },
  "mappings" : {
    "contact" : {
      "properties" : {
        "name" : { 
          "type" : "string", 
          "index_analyzer" : "autocomplete", 
          "index" : "analyzed", "search_analyzer" : "standard" 
        },
        "country" : { "type" : "string" }
      }   
    }       
  }         
}

The mapping for this index contains the name field. This is the field that we'll use to generate suggestions. As you can see, this field has a different analyzer defined for indexing and searching. During indexing, ElasticSearch cuts input words into edge ngrams but while searching this is not necessary (and not desired) as the user already provides a part of the field. Note the engram tokenizer configuration.

There are two options we are currently interested in. They are:

  • min_gram: Tokens that are shorter than the value of this parameter won't be built. This value directly influences the minimum number of characters that will have to be provided in order to get suggestions. In our case it's 3.
  • max_gram: The tokenizer will ignore tokens for longer than the value given by this parameter. This is the maximum number of characters for which suggestions will be available. You can use any reasonable limit. In this example, we assume that no one will need suggestions for longer queries because the user probably knows what he or she wants to find.

Now, let's look at how it works. For this example, we can use the simplest form of query, as follows:

curl -XGET 'localhost:9200/addressbook/_search?q=name:joh&pretty'

For names shorter than three characters, the search returns no results. For more than three, the number of results will be equal to the available suggestions.

If you look again at the specified analyzer, you'll notice that the analyzer used for the name field doesn't split the text into words before changing it to ngrams. Because of that, this version of autocomplete suits well for situations where we assume that our users will write the exact content of the field. This is usually the case when our users use the autocomplete functionality as a tool for faster choice of a particular value. This is useful for countries, but for an address book and people's names we need to suggest something regardless of whether the user starts to type the first name or the last name. After small changes done to our mapping, it will look like the following:

{
  "settings" : {
    "index" : {
      "analysis" : {
        "analyzer" : {
          "autocomplete" : {
            "tokenizer" : "whitespace",
            "filter" : ["lowercase", "engram"]
          }
        },
        "filter" : {
          "engram" : {
            "type" : "edgeNGram",
            "min_gram" : 3,
            "max_gram" : 10
          }
        }
      }
    }
  },
  "mappings" : {
    "contact" : {
      "properties" : {
        "name" : { "type" : "string", "index_analyzer" : "autocomplete", "index" : "analyzed", "search_analyzer" : "standard" },
        "country" : { "type" : "string" }
      }
    }
  }
}

Note the highlighted fragments. First, we see that the tokenizer was changed to whitespace. This provides the functionality of dividing our document field's data into words on the basis of whitespace characters. Because there can be only one tokenizer, we moved the engram definition to a filter. Remember that the tokenizer and the filters are different objects with different usage (the tokenizer takes input and divides it into tokens and the filters operate on the token stream provided by the tokenizer and can change tokens), but fortunately ElasticSearch provides edgeNGram both as a tokenizer and as a filter so the changes are simple. The rest is the same as in the previous example. But now we can fetch suggestions for part of every word in the name field. For example, the following query will find the record with Joseph Heller unlike the previous example query:

curl -XGET 'localhost:9200/addressbook/_search?q=name:jos&pretty'

The following query finds the record with Joseph Heller as well:

curl -XGET 'localhost:9200/addressbook/_search?q=name:hell&pretty'

Faceting

The third possible way of implementing the autocomplete functionality is based on faceting. We haven't written about faceting yet, so don't worry if you have no idea of how it works. Everything will be explained in the chapter dedicated to faceting (Chapter 6, Beyond Searching). For now, let's assume that faceting is a functionality that allows us to get information about the distribution of a particular document value in the result set. In fact, this solution is an extension of the previous idea. It introduces the possibility of working with repeatable tokens and it is suitable for suggestions based on non-dictionary data. First let's look at the rewritten index configuration:

{
 "settings" : {
  "index" : {
   "analysis" : {
    "analyzer" : {
     "autocomplete" : {
      "tokenizer" : "whitespace",
      "filter" : ["lowercase", "engram"]
     }
    },
    "filter" : {
     "engram" : {
      "type" : "edgeNGram",
      "min_gram" : 3,
      "max_gram" : 10
     }
    }
   }
  }
 },
 "mappings" : {
  "contact" : {
   "properties" : {
    "name" : { 
     "type" : "multi_field",
     "fields" : {
      "name" : { "type" : "string", "index" : "not_analyzed" },
      "autocomplete" : { "type" : "string", "index_analyzer" : "autocomplete", "index" : "analyzed", "search_analyzer" : "standard" }
     }
    },
    "country" : { "type" : "string" }
   }
  }
 }
}

The only difference from the previous example is the additional not_analyzed field, which we will use as a facet label. This is a common technique for functionalities such as autocomplete. We prepare several forms of one field where each form has its own use. For example, if we want to search on this field as well, we can add another analyzed copy.

Since this time the query will be more complicated, we put it in the facet_query.json file. Its contents are:

{
 "size" : 0,
 "query" : {
  "term" : { "name.autocomplete" : "jos" }
 },
 "facets" : {
  "name" : {
   "terms" : {
    "field" : "name"
   }
  }
 }
}

We are searching for every name starting with jos. This is exactly the same as in the previous example. But look at the size parameter. We don't want any document to be returned. Why? Because all the information is in facets and document data is only an additional ballast. Now, let's execute our search by sending the following command:

curl -XGET 'localhost:9200/addressbook/_search?pretty' -d @facet_query.json

You know a little about faceting, so this time we'll show the returned data:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.095891505,
    "hits" : [ ]
  },
  "facets" : {
    "name" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 1,
      "other" : 0,
      "terms" : [ {
        "term" : "Joseph Heller",
        "count" : 1
      } ]
    }
  }
}

As you can see in the highlighted part we have a single suggestion returned. And in addition to the suggestion, you can also see the count parameter, which holds the information about how many times it appeared in the matched documents. If we had more suggestions, the first 10 of them would show as the values in the terms array. Why 10? How to change this? This is something you can learn in Chapter 6, Beyond Searching, in the section dedicated to faceting.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset