Searching content in different languages

Until now, when discussing language analysis, we've talked mostly about theory. We didn't see an example regarding language analysis, handling multiple languages that our data can consist of, and so on. Now this will change, as this section is dedicated to information about how we can handle data in multiple languages.

Handling languages differently

As you already know, Elasticsearch allows us to choose different analyzers for our data. We can have our data divided on the basis of whitespaces, or have them lowercased, and so on. This can usually be done regardless of the language –the same tokenization on the basis of whitespaces will work for English, German, and Polish, although it won't work for Chinese. However, what if you want to find documents that contain words such as cat and cats by only sending the word cat to Elasticsearch? This is where language analysis comes into play with stemming algorithms for different languages, which allow the analyzed words to be reduced to their root forms. And now the worst part – we can't use one general stemming algorithm for all the languages in the world; we have to choose one appropriate language. The following sections in the chapter will help you with some parts of the language analysis process.

Handling multiple languages

There are a few ways of handling multiple languages in Elasticsearch and all of them have some pros and cons. We won't be discussing everything, but just for the purpose of giving you an idea, a few of those methods are as follows:

  • Storing documents in different languages as different types
  • Storing documents in different languages in separate indices
  • Storing language data in different fields of a single document

For the purpose of the book, we will focus on a single method – the one that allows storing documents in different languages in a single index. We will focus on a problem where we have a single type of document, but each document may come from anywhere in the world and thus can be written in multiple languages. Also, we would like to enable our users to use all the analysis capabilities, such as stemming and stop words for different languages, not only for English.

Note

Note that the stemming algorithms perform differently for different languages, both in terms of analysis performance and the resulting terms. For example, English stemmers are very good, but you can run into issues with European languages, such as German.

Detecting the language of the document

Before we continue with showing you how to solve our problem with handling multiple languages in Elasticsearch, we would like to tell you about one additional thing, that is language detection. There are situations where you just don't know what language your document or query are in. In such cases, language detection libraries may be a good choice, especially when using Java as your programming language of choice. Some of the libraries are as follows:

The language detection library claims to have over 99 percent precision for 53 languages; that's a lot if you ask us.

You should remember, though, that data language detection will be more precise for longer text. Because the text of queries is usually short, you can expect to have some degree of error during query language identification.

Sample document

Let's start with introducing a sample document, which is as follows:

{
     "title" : "First test document",
     "content" : "This is a test document"
}

As you can see, the document is pretty simple; it contains the following two fields:

  • title: This field holds the title of the document
  • content: This field holds the actual content of the document

This document is quite simple, but, from the search point of view, the information about document language is missing. What we should do is enrich the document by adding the needed information. We can do that by using one of the previously mentioned libraries, which will try to detect the language.

After we have the language detected, we inform Elasticsearch which analyzer should be used and modify the document to directly show the language of each field. Each of the fields would have to be analyzed by a language analyzer dedicated to the detected language.

Note

A full list of these language analyzers can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html).

If a document is written in a language that we are not supporting, we will just fall back to some default field with the default analyzer. For example, our processed and prepared for indexing document could look like this:

{
     "title_english" : "First test document",
     "content_english" : "This is a test document"
}

The thing is that all this processing we've mentioned would have to be done outside of Elasticsearch or in some kind of custom plugin that would implement the mentioned logic.

Note

In the previous versions of Elasticsearch, there was a possibility of choosing an analyzer based on the value of an additional field, which contained the analyzer name. This was a more convenient and elegant way but introduced some uncertainty about the field contents. You always had to deliver a proper analyzer when using the given field or strange things happened. The Elasticsearch team made the difficult decision and removed this feature.

There is also a simpler way: we can take our first document and index it in several ways independently from input language. Let's focus on this solution.

The mappings

To handle our solution, which will process the document using several defined languages, we need new mappings. Let's look at the mappings we've created to index our documents (we've stored them in the mappings.json file):

{
  "mappings" : {
    "doc" : {
      "properties" : {
        "title" : {
          "type" : "string",
          "index" : "analyzed",
          "fields" : {
            "english" : {
              "type" : "string",
              "index" : "analyzed",
              "analyzer" : "english"
            },
            "russian" : {
              "type" : "string",
              "index" : "analyzed",
              "analyzer" : "russian"
            },
            "german" : {
              "type" : "string",
              "index" : "analyzed",
              "analyzer" : "german"
            }
          }
        },
        "content" : {
          "type" : "string",
          "index" : "analyzed",
          "fields" : {
            "english" : {
              "type" : "string",
              "index" : "analyzed",
              "analyzer" : "english"
            },
            "russian" : {
              "type" : "string",
              "index" : "analyzed",
              "analyzer" : "russian"
            },
            "german" : {
              "type" : "string",
              "index" : "analyzed",
              "analyzer" : "german"
            }
          }
        }
      }
    }
  }
}

In the preceding mappings, we've shown the definition for the title and content fields (if you are not familiar with any aspect of mappings definition, refer to the Mappings configuration section of Chapter 2, Indexing Your Data). We have used the multifield feature of Elasticsearch: each field can be indexed in several ways using various language analyzers (in our example, those analyzers are: English, Russian, and German).

In addition, the base field uses the default analyzer, which we may use at query time when the language is unknown. So, each field will actually have four fields – the default one and three language oriented fields.

In order to create a sample index called docs that uses our mappings, we will use the following command:

curl -XPUT 'localhost:9200/docs' -d @mappings.json

Querying

Now let's see how we can query our data to use the newly created language fields. We can divide the querying situation into two different cases. Of course, to start querying we need documents. Let's index our example document by running the following command:

curl -XPOST 'localhost:9200/docs/doc/1' -d '{"title" : "First test document","content" : "This is a test document"}'

Queries with an identified language

The first case is when we have our query language identified. Let's assume that the identified language is English. In such cases, our query is as follows:

curl 'localhost:9200/docs/_search?pretty' -d '{
  "query" : {
    "match" : {
      "content.english" : "documents"
    }
  }
}'

The thing to put emphasis on in the preceding query is the field used for querying and the query type. The field used is content.english, which also indicates which analyzer we want to use. We used that field because we had identified our language before running the query. Thanks to this, the English analyzer can find our document even if we have the singular form of the word in the document. The response returned by Elasticsearch will be as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "docs",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.19178301,
      "_source": {
        "title" : "First test document",
        "content" : "This is a test document"
      }
    } ]
  }
}

The thing to note is also the query type – the match query. We used the match query because it analyzes its body with the analyzer used by the field that it is run against. We need that to properly match the data in the query and the data in the index.

Queries with an unknown language

Now let's look at the second situation – handling queries when we couldn't identify the language of the query. In such cases, we can't use the field name pointing to one of the languages, such as content.german. In such a case, we use the default field which uses the default analyzer and we send the query to the content field instead. The query will look as follows:

curl 'localhost:9200/docs/_search?pretty' -d '{
  "query" : {
    "match" : {
      "content" : "documents"
    }
  }
}'

However, we didn't get any results this time because the default analyzer can't deal with a singular form of a word when we are searching with a plural form.

Combining queries

To additionally boost the documents that perfectly match with our default analyzer, we can combine the two preceding queries with the bool query. Such a combined query will look as follows:

curl -XGET 'localhost:9200/docs/_search?pretty=true ' -d '{
  "query" : {
    "bool" : {
      "minimum_should_match" : 1,
      "should" : [
        {
          "match" : {
            "content.english" : "documents"
          }
        },
        {
          "match" : {
            "content" : "documents"
          }
        }
      ]
    }
  }
}'

For the document to be returned, at least one of the defined queries must match. If they both match, the document will have a higher score value and will be placed higher in the results.

There is one additional advantage of the preceding combined query. If our language analyzer doesn't find a document (for example, when the analysis is different from the one used during indexing), the second query has a chance to find the terms that are tokenized only by whitespace characters and lowercase.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset