Until now, when discussing language analysis, we've talked mostly about theory. We didn't see an example regarding language analysis, handling multiple languages that our data can consist of, and so on. Now this will change, as this section is dedicated to information about how we can handle data in multiple languages.
As you already know, Elasticsearch allows us to choose different analyzers for our data. We can have our data divided on the basis of whitespaces, or have them lowercased, and so on. This can usually be done regardless of the language –the same tokenization on the basis of whitespaces will work for English, German, and Polish, although it won't work for Chinese. However, what if you want to find documents that contain words such as cat and cats by only sending the word cat to Elasticsearch? This is where language analysis comes into play with stemming algorithms for different languages, which allow the analyzed words to be reduced to their root forms. And now the worst part – we can't use one general stemming algorithm for all the languages in the world; we have to choose one appropriate language. The following sections in the chapter will help you with some parts of the language analysis process.
There are a few ways of handling multiple languages in Elasticsearch and all of them have some pros and cons. We won't be discussing everything, but just for the purpose of giving you an idea, a few of those methods are as follows:
For the purpose of the book, we will focus on a single method – the one that allows storing documents in different languages in a single index. We will focus on a problem where we have a single type of document, but each document may come from anywhere in the world and thus can be written in multiple languages. Also, we would like to enable our users to use all the analysis capabilities, such as stemming and stop words for different languages, not only for English.
Before we continue with showing you how to solve our problem with handling multiple languages in Elasticsearch, we would like to tell you about one additional thing, that is language detection. There are situations where you just don't know what language your document or query are in. In such cases, language detection libraries may be a good choice, especially when using Java as your programming language of choice. Some of the libraries are as follows:
The language detection library claims to have over 99 percent precision for 53 languages; that's a lot if you ask us.
You should remember, though, that data language detection will be more precise for longer text. Because the text of queries is usually short, you can expect to have some degree of error during query language identification.
Let's start with introducing a sample document, which is as follows:
{ "title" : "First test document", "content" : "This is a test document" }
As you can see, the document is pretty simple; it contains the following two fields:
title
: This field holds the title of the documentcontent
: This field holds the actual content of the documentThis document is quite simple, but, from the search point of view, the information about document language is missing. What we should do is enrich the document by adding the needed information. We can do that by using one of the previously mentioned libraries, which will try to detect the language.
After we have the language detected, we inform Elasticsearch which analyzer should be used and modify the document to directly show the language of each field. Each of the fields would have to be analyzed by a language analyzer dedicated to the detected language.
A full list of these language analyzers can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html).
If a document is written in a language that we are not supporting, we will just fall back to some default field with the default analyzer. For example, our processed and prepared for indexing document could look like this:
{ "title_english" : "First test document", "content_english" : "This is a test document" }
The thing is that all this processing we've mentioned would have to be done outside of Elasticsearch or in some kind of custom plugin that would implement the mentioned logic.
In the previous versions of Elasticsearch, there was a possibility of choosing an analyzer based on the value of an additional field, which contained the analyzer name. This was a more convenient and elegant way but introduced some uncertainty about the field contents. You always had to deliver a proper analyzer when using the given field or strange things happened. The Elasticsearch team made the difficult decision and removed this feature.
There is also a simpler way: we can take our first document and index it in several ways independently from input language. Let's focus on this solution.
To handle our solution, which will process the document using several defined languages, we need new mappings. Let's look at the mappings we've created to index our documents (we've stored them in the mappings.json
file):
{ "mappings" : { "doc" : { "properties" : { "title" : { "type" : "string", "index" : "analyzed", "fields" : { "english" : { "type" : "string", "index" : "analyzed", "analyzer" : "english" }, "russian" : { "type" : "string", "index" : "analyzed", "analyzer" : "russian" }, "german" : { "type" : "string", "index" : "analyzed", "analyzer" : "german" } } }, "content" : { "type" : "string", "index" : "analyzed", "fields" : { "english" : { "type" : "string", "index" : "analyzed", "analyzer" : "english" }, "russian" : { "type" : "string", "index" : "analyzed", "analyzer" : "russian" }, "german" : { "type" : "string", "index" : "analyzed", "analyzer" : "german" } } } } } } }
In the preceding mappings, we've shown the definition for the title
and content
fields (if you are not familiar with any aspect of mappings definition, refer to the Mappings configuration section of Chapter 2, Indexing Your Data). We have used the multifield feature of Elasticsearch: each field can be indexed in several ways using various language analyzers (in our example, those analyzers are: English, Russian, and German).
In addition, the base field uses the default analyzer, which we may use at query time when the language is unknown. So, each field will actually have four fields – the default one and three language oriented fields.
In order to create a sample index called docs that uses our mappings, we will use the following command:
curl -XPUT 'localhost:9200/docs' -d @mappings.json
Now let's see how we can query our data to use the newly created language fields. We can divide the querying situation into two different cases. Of course, to start querying we need documents. Let's index our example document by running the following command:
curl -XPOST 'localhost:9200/docs/doc/1' -d '{"title" : "First test document","content" : "This is a test document"}'
The first case is when we have our query language identified. Let's assume that the identified language is English. In such cases, our query is as follows:
curl 'localhost:9200/docs/_search?pretty' -d '{ "query" : { "match" : { "content.english" : "documents" } } }'
The thing to put emphasis on in the preceding query is the field used for querying and the query
type. The field used is content.english
, which also indicates which analyzer we want to use. We used that field because we had identified our language before running the query. Thanks to this, the English analyzer can find our document even if we have the singular form of the word in the document. The response returned by Elasticsearch will be as follows:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.19178301, "hits" : [ { "_index" : "docs", "_type" : "doc", "_id" : "1", "_score" : 0.19178301, "_source": { "title" : "First test document", "content" : "This is a test document" } } ] } }
The thing to note is also the query type – the match
query. We used the match query because it analyzes its body with the analyzer used by the field that it is run against. We need that to properly match the data in the query and the data in the index.
Now let's look at the second situation – handling queries when we couldn't identify the language of the query. In such cases, we can't use the field name pointing to one of the languages, such as content.german
. In such a case, we use the default field which uses the default analyzer and we send the query to the content field instead. The query will look as follows:
curl 'localhost:9200/docs/_search?pretty' -d '{ "query" : { "match" : { "content" : "documents" } } }'
However, we didn't get any results this time because the default analyzer can't deal with a singular form of a word when we are searching with a plural form.
To additionally boost the documents that perfectly match with our default analyzer, we can combine the two preceding queries with the bool
query. Such a combined query will look as follows:
curl -XGET 'localhost:9200/docs/_search?pretty=true ' -d '{ "query" : { "bool" : { "minimum_should_match" : 1, "should" : [ { "match" : { "content.english" : "documents" } }, { "match" : { "content" : "documents" } } ] } } }'
For the document to be returned, at least one of the defined queries must match. If they both match, the document will have a higher score value and will be placed higher in the results.
There is one additional advantage of the preceding combined query. If our language analyzer doesn't find a document (for example, when the analysis is different from the one used during indexing), the second query has a chance to find the terms that are tokenized only by whitespace characters and lowercase.