Using suggesters

A long time ago, starting from Elasticsearch 0.90 (which was released on April 29, 2013), we got the ability to use so-called suggesters. We can define a suggester as a functionality allowing us to correct the user's spelling mistakes and build autocomplete functionality keeping performance in mind. This section is dedicated to these functionalities and will help you learn about them. We will discuss each available suggester type and show the most common properties that allow us to control them. However, keep in mind that this section is not a comprehensive guide describing each and every property. Description of all the details about suggesters are a very broad topic and is out of the scope of this book. If you want to dig into their functionality, refer to the official Elasticsearch documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html) or to the Mastering Elasticsearch Second Edition book published by Packt Publishing.

Available suggester types

These have changed since the initial introduction of the Suggest API to Elasticsearch. We are now able to use four type of suggesters:

  • term: A suggester returning corrections for each word passed to it. Useful for suggestions that are not phrases, such as single term queries.
  • phrase: A suggester working on phrases, returning a proper phrase.
  • completion: A suggester designed to provide fast and efficient autocomplete results.
  • context: Extension to the Suggest API of Elasticsearch. Allows us to handle parts of the suggest queries in memory and thus very effective in terms of performance.

Including suggestions

Let's now try getting suggestions along with the query results. For example, let's use a match_all query and try getting a suggestion for a serlock holnes phrase, which has two terms spelled incorrectly. To do this, we run the following command:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "match_all" : {}
 },
 "suggest" : {
  "first_suggestion" : {
   "text" : "serlock holnes",
   "term" : {
    "field" : "_all"
   }
  }
 }
}'

As you can see, we've introduced a new section to our query – the suggest one. We've specified the text we want to get the correction for by using the text property. We've specified the suggester we want to use (the term one) and configured it specifying the name of the field that should be used for building suggestions using the field property. first_suggestion is the name we give to our suggester; we need to do this because there can be multiple ones used. This is how you send a request for suggestion in general.

If we want to get multiple suggestions for the same text, we can embed our suggestions in the suggest object and place the text property as the suggest object option. For example, if we want to get suggestions for the serlock holnes text for the title field and for the _all field, we run the following command:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "match_all" : {}
 },
 "suggest" : {
  "text" : "serlock holnes",
  "first_suggestion" : {
   "term" : {
    "field" : "_all"
   }
  },
  "second_suggestion" : {
   "term" : {
    "field" : "title"
   }
  }
 }
}'

Suggester response

Now let's look at the response of the first query we sent. As you can guess, the response includes both the query results and the suggestions:

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ ... ]
  },
  "suggest" : {
    "first_suggestion" : [ {
      "text" : "serlock",
      "offset" : 0,
      "length" : 7,
      "options" : [ {
        "text" : "sherlock",
        "score" : 0.85714287,
        "freq" : 1
      } ]
    }, {
      "text" : "holnes",
      "offset" : 8,
      "length" : 6,
      "options" : [ {
        "text" : "holmes",
        "score" : 0.8333333,
        "freq" : 1
      } ]
    } ]
  }
}

We can see that we got both the search results and the suggestions (we've omitted the results to make the example more readable) in the response. The term suggester returned a list of possible suggestions for each term that was present in the text parameter. For each term, the term suggester returns an array of possible suggestions. Looking at the data returned for the serlock term, we can see the original word (the text parameter), its offset in the original text parameter (the offset parameter), and its length (the length parameter).

The options array contains suggestions for the given word and will be empty if Elasticsearch doesn't find any suggestions. Each entry in this array is a suggestion and described by the following properties:

  • text: Text of the suggestion.
  • score: Suggestion score; the higher the score, the better the suggestion.
  • freq: Frequency of the suggestion. The frequency represents how many times the word appears in the documents in the index we are running the suggestion query against.

Term suggester

The term suggester works on the basis of string edit distance. This means that the suggestion with the fewest characters that need to be changed, added, or removed to make the suggestion look as the original word, is the best one. For example, let's take the words worl and work. To change the worl term to work, we need to change the l letter to k, so it means a distance of 1. The text provided to the suggester is of course analyzed and then terms are chosen to be suggested.

Term suggester configuration options

The common and most used term suggester options can be used for all the suggester implementations that are based on the term one. Currently, these are the phrase suggester and of course the base term one. The available options are:

  • text: The text we want to get the suggestions for. This parameter is required in order for the suggester to work.
  • field: Another required parameter that we need to provide. The field parameter allows us to set which field the suggestions should be generated for.
  • analyzer: The name of the analyzer which should be used to analyze the text provided in the text parameter. If not set, Elasticsearch utilizes the analyzer used for the field provided by the field parameter.
  • size: Defaults to 5 and specifies the maximum number of suggestions allowed to be returned by each term provided in the text parameter.
  • suggest_mode: Controls which suggestions will be included and for what terms the suggestions will be returned. The possible options are: missing – the default behavior, which means that the suggester will only provide suggestions for terms that are not present in the index; popular – means that the suggestions will only be returned when they are more frequent than the provided term; and finally always means that suggestions will be returned every time.
  • sort: Allows us to specify how the suggestions are sorted in the result returned by Elasticsearch. By default, it is set to score, which tells Elasticsearch that the suggestions should be sorted by the suggestion score first, the suggestion document frequency next, and finally by the term. The second possible value is frequency, which means that the results are first sorted by the document frequency, then by the score, and finally by the term.

Additional term suggester options

In addition to the preceding common term suggest options, Elasticsearch allows us to use additional ones that only make sense for the term suggester itself. Some of these options are as follows:

  • lowercase_terms: When set to true, it tells Elasticsearch to lowercase all the terms that are produced from the text field after analysis.
  • max_edits: It defaults to 2 and specifies the maximum edit distance that the suggestion can have to be returned as a term suggestion. Elasticsearch allows us to set this value to 1 or 2.
  • prefix_len: By default, it is set to 1. If we are struggling with suggester performance, increasing this value will improve the overall performance, because fewer suggestions will need to be processed.
  • min_word_len: It defaults to 4 and specifies the minimum number of characters a suggestion must have in order to be returned on the suggestions list.
  • shard_size: It defaults to the value specified by the size parameter and allows us to set the maximum number of suggestions that should be read from each shard. Setting this property to values higher than the size parameter can result in more accurate document frequency at the cost of degradation in suggester performance.

    Note

    The provided list of parameters does not contain all the options that are available for the term suggester. Refer to the official Elasticsearch documentation for reference, at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-term.html.

Phrase suggester

The term suggester provides a great way to correct user spelling mistakes on per term basis, but it is not great for phrases. That's why the phrase suggester was introduced. It is built on top of the term suggester, but adds additional phrase calculation logic to it.

Let's start with an example of how to use the phrase suggester. This time we will omit the query section in our query. We do that by running the following command:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "suggest" : {
  "text" : "sherlock holnes",
  "our_suggestion" : {
   "phrase" : { "field" : "_all" }
  }
 }
}'

As you can see in the preceding command, it is almost the same as we sent when using the term suggester, but instead of specifying the term suggester type we've specified the phrase type. The response to the preceding command is as follows:

{
  "took" : 24,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ ... ]
  },
  "suggest" : {
    "our_suggestion" : [ {
      "text" : "sherlock holnes",
      "offset" : 0,
      "length" : 15,
      "options" : [ {
        "text" : "sherlock holmes",
        "score" : 0.12227806
      } ]
    } ]
  }
}

As you can see, the response is very similar to the one returned by the term suggester but, instead of a single word being returned, it is already combined and returned as a phrase.

Configuration

Because the phrase suggester is based on the term suggester, it can also use some of the configuration options provided by it. Those options are: text, size, analyzer, and shard_size. In addition to the mentioned properties, the phrase suggester exposes additional options. Some of these options are:

  • max_errors: Specifies the maximum number (or percentage) of terms that can be erroneous in order to create a correction using it. The value of this property can be either an integer number, such as 1, or a float between 0 and 1 which will be treated as a percentage value. By default, it is set to 1, which means that at most a single term can be misspelled in a given correction.
  • separator: Defaults to a whitespace character and specifies the separator that will be used to divide the terms in the resulting bigram field.

    Note

    The provided list of parameters does not contain all the options that are available for the phrase suggester. In fact, the list is way more extensive than what we've provided. Refer to the official Elasticsearch documentation for reference, at https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-phrase.html, or to Mastering Elasticsearch Second Edition published by Packt Publishing.

Completion suggester

The completion suggester allows us to create autocomplete functionality in a very performance-effective way, because of storing complicated structures in the index instead of calculating them during query time. We need to prepare Elasticsearch for that by using a dedicated field type called completion. Let's assume that we want to create an autocomplete feature to allow us to show book authors. In addition to author's name we want to return the identifiers of the books she/he wrote. We start with creating the authors index by running the following command:

curl -XPOST 'localhost:9200/authors' -d '{
 "mappings" : {
  "author" : {
   "properties" : {
    "name" : { "type" : "string" },
    "ac" : {
     "type" : "completion",
     "payloads" : true,
    "analyzer" : "standard",
    "search_analyzer" : "standard"
    }
   }
  }
 }
}'

Our index will contain a single type called author. Each document will have two fields: the name and the ac field, which is the field we will use for autocomplete. We've defined the ac field using the completion type. In addition to that, we've used the standard analyzer for both the index and the query time. The last thing is the payload - the additional, optional information we will return along with the suggestion - in our case it will be an array of book identifiers.

Indexing data

To index the data, we need to provide some additional information along with the ones we usually provide during indexing. Let's look at the following commands that index two documents describing the authors:

curl -XPOST 'localhost:9200/authors/author/1' -d '{
 "name" : "Fyodor Dostoevsky",
 "ac" : {
  "input" : [ "fyodor", "dostoevsky" ],
  "output" : "Fyodor Dostoevsky",
  "payload" : { "books" : [ "123456", "123457" ] }
 }
}'
curl -XPOST 'localhost:9200/authors/author/2' -d '{
 "name" : "Joseph Conrad",
 "ac" : {
  "input" : [ "joseph", "conrad" ],
  "output" : "Joseph Conrad",
  "payload" : { "books" : [ "121211" ] }
 }
}'

Note the structure of the data for the ac field. We have provided the input, output, and payload properties. The optional payload property is used to provide the additional information that will be returned. The input property is used to provide the input information that will be used for building the completion used by the suggester. It will be used for user input matching. The optional output property is used to tell the suggester which data should be returned for the document.

We can also omit the additional parameters section and index data in the way we are used to, just like in the following example:

curl -XPOST 'localhost:9200/authors/author/1' -d '{
 "name" : "Fyodor Dostoevsky",
 "ac" : "Fyodor Dostoevsky"
}'

However, because the completion suggester uses FST under the hood, we won't be able to find the preceding document by starting with the second part of the ac field. That's why we think that indexing the data in the way we showed first is more convenient, because we can explicitly control what we want to match and what we want to show as an output.

Querying indexed completion suggester data

If we want to find documents that have authors starting with fyo, we run the following command:

curl -XGET 'localhost:9200/authors/_suggest?pretty' -d '{
 "authorsAutocomplete" : {
  "text" : "fyo",
  "completion" : {
   "field" : "ac"
  }
 }
}'

Before we look at the results, let's discuss the query. As you can see, we've run the command to the _suggest endpoint, because we don't want to run a standard query; we are just interested in the autocomplete results. The query is quite simple. We set its name to authorsAutocomplete, we set the text we want to get the completion for (the text property), and we added the completion object with the configuration in it. The result of the preceding command looks as follows:

{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "authorsAutocomplete" : [ {
    "text" : "fyo",
    "offset" : 0,
    "length" : 3,
    "options" : [ {
      "text" : "Fyodor Dostoevsky",
      "score" : 1.0,
      "payload" : {
        "books" : [ "123456", "123457" ]
      }
    } ]
  } ]
}

As you can see in the response, we get the document we were looking for along with the payload information, if it is available (for the preceding response, it is not).

We can also use fuzzy searches, which allow us to tolerate spelling mistakes. We do that by including the additional fuzzy section in our query. For example, to enable fuzzy matching in the completion suggester and set the maximum edit distance to 2 (which means that a maximum of two errors are allowed), we send the following query:

curl -XGET 'localhost:9200/authors/_suggest?pretty' -d '{
 "authorsAutocomplete" : {
  "text" : "fio",
  "completion" : {
   "field" : "ac",
   "fuzzy" : {
    "edit_distance" : 2
   }
  }
 }
}'

Although we've made a spelling mistake, we will still get the same results as we got earlier.

Custom weights

By default, the term frequency is used to determine the weight of the document returned by the prefix suggester. However, this may not be the best solution. In such cases, it is useful to define the weight of the suggestion by specifying the weight property for the field defined as completion. The weight property should be set to an integer value. The higher the weight property value, the more important the suggestion. For example, if we want to specify a weight for the first document in our example, we run the following command:

curl -XPOST 'localhost:9200/authors/author/1' -d '{
 "name" : "Fyodor Dostoevsky",
 "ac" : {
  "input" : [ "fyodor", "dostoevsky" ],
  "output" : "Fyodor Dostoevsky",
  "payload" : { "books" : [ "123456", "123457" ] },
  "weight" : 30
 }
}'

Now if we run our example query, the results will be as follows:

{
  ...
  "authorsAutocomplete" : [ {
    "text" : "fyo",
    "offset" : 0,
    "length" : 3,
    "options" : [ {
      "text" : "Fyodor Dostoevsky",
      "score" : 30.0, 
      "payload":{
        "books":["123456","123457"]
      }
    } ]
  } ]
}

Look how the score of the result changed. In our initial example, it was 1.0 and now it is 30.0. This is so because we set the weight parameter to 30 during indexing.

Context suggester

The context suggester is an extension to the Elasticsearch Suggest API for Elasticsearch 2.1 and older versions that we just discussed. When describing the completion suggester for Elasticsearch 2.1, we mentioned that this suggester allows us to handle suggester-related searches entirely in memory. Using this suggester, we can define the so called context for the query that will limit the suggestions to a subset of documents. Because we define the context in the mappings, it is calculated during indexation, which makes query time calculations easier and less demanding in terms of performance.

Note

Remember that this section is related to Elasticsearch 2.1. Contexts in Elasticsearch 2.2 are handled differently and were discussed when discussing the completion suggester.

Context types

Elasticsearch 2.1 supports two types of context: category and geo. The category type of context allows us to assign a document to one or more categories during the index time. Later, during the query time, we can tell Elasticsearch which category we are interested in and Elasticsearch will limit the suggestions to those categories. The geo context allows us to limit the documents returned by the suggesters to a given location or to a certain distance from a point. The nice thing about context is that we can have multiple contexts. For example, we can have both the category context and the geo context for the same document. Let's now see what we need to do to use context in suggestions.

Using context

Using the geo and category context is very similar – they just differ in parameters. We will show you how to use contexts in an example using the simpler category context and later we will get back to the geo context and show you what we need to provide.

The first step when using context suggester is creating a proper mapping. Let's get back to our author mapping, but this time let's assume that each author can be given one or more category – the brand of books she/he is writing. This will be our context. The mappings using the context look as follows:

curl -XPOST 'localhost:9200/authors_geo_context' -d '{
 "mappings" : {
  "author" : {
   "properties" : {
    "name" : { "type" : "string" },
    "ac" : {
     "type" : "completion",
     "analyzer" : "simple",
     "search_analyzer" : "simple",
     "context" : {
      "brand" : {
       "type" : "category",
       "default" : [ "none" ]
      }
     }
    }
   }
  }
 }
}'

We've introduced a new section in our ac field definition: context. Each context is given a name, which is brand in our case, and inside that object we provide configuration. We need to provide the type using the type property – we will be using the category context suggester now. In addition to that, we've set the default array, which provides us with the value or values that should be used as the default context. If we want, we can also provide the path property, which will point Elasticsearch to a field in the documents from which the context value should be taken.

We can now index a single author by modifying the commands we used earlier, because we need to provide the context:

curl -XPOST 'localhost:9200/authors_context/author/1' -d '{
 "name" : "Fyodor Dostoevsky",
 "ac" : {
  "input" : "Fyodor Dostoevsky",
  "context" : {
   "brand" : "drama"
  }
 }
}'

As you can see, the ac field definition is a bit different now; it is an object. The input property is used to provide the value for autocomplete and the context object is used to provide the values for each of the contexts defined in the mappings.

Finally, we can query the data. As you could imagine, we will again provide the context we are interested in. The query that does that looks as follows:

curl -XGET 'localhost:9200/authors_context/_suggest?pretty' -d '{
 "authorsAutocomplete" : {
  "text" : "fyo",
  "completion" : {
   "field" : "ac",
   "context" : {
    "brand" : "drama"
   }
  }
 }
}'

As you can see, we've included the context object in the query inside the completion section and we've set the context we are interested in using the context name. The response returned by Elasticsearch is as follows:

{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "authorsAutocomplete" : [ {
    "text" : "fyo",
    "offset" : 0,
    "length" : 3,
    "options" : [ {
      "text" : "Fyodor Dostoevsky",
      "score" : 1.0
    } ]
  } ]
}

However, if we change the brand context to comedy, for example, Elasticsearch will return no results, because we don't have authors with such a context. Let's test it by running the following query:

curl -XGET 'localhost:9200/authors_context/_suggest?pretty' -d '{
 "authorsAutocomplete" : {
  "text" : "fyo",
  "completion" : {
   "field" : "ac",
   "context" : {
    "brand" : "comedy"
   }
  }
 }
}'

This time Elasticsearch returns the following response:

{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "authorsAutocomplete" : [ {
    "text" : "fyo",
    "offset" : 0,
    "length" : 3,
    "options" : [ ]
  } ]
}

This is because no author with the brand context and the value of comedy is present in the authors_context index.

Using the geo location context

The geo context is similar to the category context when it comes to using it. However, instead of filtering by terms, we filter using geographical points and distances. When we use the geo context, we need to provide precision, which defines the precision of the calculated geohash. The second property that we provide is the neighbors one, which can be set to true or false. By default, it is set to true, which means that the neighboring geohashes will be included in the context.

In addition to that, similar to the category context, we can provide path, which specifies which field to use as the lookup for the geographical point, and the default property, specifying the default geopoint for the documents.

For example, let's assume that we want to filter on the birth place of our authors. The mappings for such a suggester will look as follows:

curl -XPOST 'localhost:9200/authors_geo_context' -d '{
 "mappings" : {
  "author" : {
   "properties" : {                
    "name" : { "type" : "string" },
    "ac" : {
     "type" : "completion",
     "analyzer" : "simple",
     "search_analyzer" : "simple",
     "context" : {
      "birth_location" : {
       "type" : "geo",
       "precision" : [ "1000km" ],
       "neighbors" : true,
       "default" : {
        "lat" : 0.0,
        "lon" : 0.0
       }
      }
     }
    }
   }
  }
 }
}'

Now we can index the documents and provide the birth location. For our example author, it will look as follows (the centre of Moscow):

curl -XPOST 'localhost:9200/authors_geo_context/author/1' -d '{
 "name" : "Fyodor Dostoevsky",
 "ac" : {
  "input" : "Fyodor Dostoevsky",
  "context" : {
   "birth_location" : {
    "lat" : 55.75,
   "lon" : 37.61
   }
  }
 }
}'

As you can see, we've provided the birth_location context for our author.

Now during query time, we need to provide the context that we are interested in and we can (but we are not obligated to) provide the precision as the subset of the precision values provided in the mappings. We've defined the precision to 1000 km, so let's find all the authors starting with fyo that were born in Kazan, which is about 800 km from Moscow. We should find our example author.

The query that does that looks as follows:

curl -XGET 'localhost:9200/authors_geo_context/_suggest?pretty' -d '{
 "authorsAutocomplete" : {
  "text" : "fyo",
  "completion" : {
   "field" : "ac",
   "context" : {
    "birth_location" : {
     "lat" : 55.45,
     "lon" : 49.8
    }
   }
  }
 }
}'

The response returned by Elasticsearch looks as follows:

{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "authorsAutocomplete" : [ {
    "text" : "fyo",
    "offset" : 0,
    "length" : 3,
    "options" : [ {
      "text" : "Fyodor Dostoevsky",
      "score" : 1.0
    } ]
  } ]
}

However, if we run the same query but point to the North Pole, we will get no results:

curl -XGET 'localhost:9200/authors_geo_context/_suggest?pretty' -d '{
 "authorsAutocomplete" : {
  "text" : "fyo",
  "completion" : {
   "field" : "ac",
   "context" : {
    "birth_location" : {
     "lat" : 0.0,
     "lon" : 0.0
    }
   }
  }
 }
}'

The following is the response from Elasticsearch in this case:

{
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "authorsAutocomplete" : [ {
    "text" : "fyo",
    "offset" : 0,
    "length" : 3,
    "options" : [ ]
  } ]
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset