The words having the same meaning

You may have heard about synonyms—words that have the same or similar meaning. Sometimes you would want to have some words to be matched when one of those words is entered into the search box. Let's recall our sample data from Chapter 2, Searching Your Data; there was a book called "Crime and Punishment". What if we want that book to be matched not only when the words crime or punishment are used, but also when using words like criminality and abuse. However silly it may sound, let's use that example to see how synonyms can be used in ElasticSearch.

Synonym filter

In order to use the synonym filter, we need to define our own analyzer (please refer to Chapter 1, Getting Started with ElasticSearch Cluster, in order to see how to do that). Our analyzer will be called synonym and will use the whitespace tokenizer and a single filter called synonym. Our filter's type property needs to be set to synonym, which tells ElasticSearch that this filter is a synonym filter. In addition to that, we want to ignore the case so that upper- and lowercased synonyms will be treated equally (set the ignore_case property to true). So, in order to define our custom synonym analyzer that uses a synonym filter, we need to have the following mappings done:

{
  "index" : {
    "analysis" : {
      "analyzer" : {
        "synonym" : {
          "tokenizer" : "whitespace",
          "filter" : [
            "synonym"
          ]
        }
      },
      "filter" : {
        "synonym" : {
          "type" : "synonym",
          "ignore_case" : true,
          "synonyms" : [
            "crime => criminality"
          ]
        }
      }
    }
  }
}

Synonyms in mappings

In the preceding definition, we've specified the synonym rule in the mappings we send to ElasticSearch. In order to do that, we need to add the synonyms property, which is an array of synonym rules, for example, the following:

"synonyms" : [
  "crime => criminality"
]

We will discuss defining the synonym rules in just a second.

Synonyms in files

ElasticSearch allows us to use file-based synonyms. In order to use a file, we need to specify the synonyms_path property instead of the synonyms one. The synonyms_path property should be set to the name of the file that holds the synonym's definition and the specified file path is relative to the ElasticSearch config directory. So, if we store our synonyms in the synonyms.txt file and we save that file in the config directory, in order to use it, we should set synonyms_path to the value of synonyms.txt.

For example, this is how the synonym filter (the one from the preceding mappings) will be, if we want to use the synonyms stored in a file:

"filter" : {
  "synonym" : {
    "type" : "synonym",
    "synonyms_path" : "synonyms.txt"
  }
}

Defining synonym rules

Till now, we have discussed what we have to do in order to use synonym expansions in ElasticSearch. Now, let's see what formats of synonyms can be used.

Using Apache Solr synonyms

The most common synonym structure in the Apache Lucene world is probably the one used by Apache Solr—the search engine build on top of Lucene, just like ElasticSearch. This is the default way of handling synonyms in ElasticSearch and the possible ways of defining a new synonym are discussed in the following sections.

Explicit synonyms

A simple mapping allows us to map a list of words into other words. So, in our case, if we want the criminality word to be mapped to crime and the abuse word to be mapped to punishment, we need to define the following entries:

criminality => crime
abuse => punishment

Of course, a single word can be mapped into multiple ones and multiple ones can be mapped into a single one, for example:

star wars, wars => starwars

The preceding example means that star wars and wars will be changed to starwars by the synonym filter.

Equivalent synonyms

In addition to the explicit mapping, ElasticSearch allows us to use equivalent synonyms. For example, the following definition will make all the words exchangeable so that you can use any of them to match a document that has one of them in its contents:

star, wars, star wars, starwars

Expanding synonyms

A synonym filter allows us to use one additional property when it comes to Apache Solr format synonyms—the expand property. When this is set to true (by default, it is set to false), all synonyms will be expanded by ElasticSearch to all equivalent forms. For example, let's say we have the following filter configuration:

"filter" : {
  "synonym" : {
    "type" : "synonym",
    "expand": false,
    "synonyms" : [
      "one, two, three"
    ] 
  }
}

ElasticSearch will map the preceding synonym definition to the following:

one, two, thee => one

This means that the words one, two, and three will be changed to one. However, if we set the expand property to true, the same synonym definition will be interpreted in the following way:

one, two, three => one, two, three

Which means that each of the words from the left side of the definition will be expanded to all the words.

Using WordNet synonyms

If we want to use WordNet-structured synonyms (to learn more about WordNet, please visit http://wordnet.princeton.edu/), we need to provide an additional property for our synonym filter. The property name is format, and we should set its value to wordnet in order for ElasticSearch to understand that format.

Query- or index-time synonym expansion

As with all the analyzers, one can wonder when we should use our synonym filter—during indexing, during querying, or maybe during both indexing and querying. Of course, it depends on your needs. However, please remember that using index-time synonyms requires data re-indexing after each synonym change, because they need to be reapplied to all the documents. If we use only query-time synonyms, we can update the lists of synonyms and have them applied (for example, after updating the mappings, which we will talk about later in this book).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset