Searching content in different languages

Till now, we've talked mostly in theory about language analysis, for example, handling multiple languages our data can consist, and things like that. This will now change as we will now discuss how we can handle multiple languages in our data.

Why we need to handle languages differently

As you already know that ElasticSearch allows us to choose different analyzers for our data, we can have our data divided into words on the basis of whitespaces, have them lowercased, and so on. This can usually be done with the data regardless of the language—you should have the same tokenization on the basis of whitespaces for English, German, and Polish (that doesn't apply to Chinese though). However, what if you want to find documents that contain words like cat and cats by only sending the cat word to ElasticSearch? This is where language analysis comes into play with stemming algorithms for different languages, which allow reducing the analyzed words into their root forms.

And now the worst part—we can't use one general stemming algorithm for all the languages in the world; we have to choose the one appropriate for each language. The following chapter will help you with some parts of the language analysis process.

How to handle multiple languages

There are a few ways of handling multiple languages in ElasticSearch and all of them have some pros and cons. We won't be discussing everything, but just for the purpose of giving you an idea, some of those ways are as follows:

  • Storing documents in different languages as different types
  • Storing documents in different languages in separate indices
  • Storing different versions of fields in a single document so that they contain different languages

However, we will focus on a single method that allows us to store documents in different languages in a single index (with some slight modifications). We will focus on a problem where we have a single type of documents, but they may come from all over the world and thus be written in multiple languages. Also we would like to enable our users to use all the analysis capabilities, like stemming and stop words for different languages, not only for English.

Detecting a document's language

If you don't know the language of your documents and queries (and this is mostly the case), you can use software for language detection that can detect (with some probability) the language of your documents and queries.

If you use Java, you can use one of the few available language detection libraries. Some of them are as follows:

Language Detection claims to have over 99 percent precision for 53 languages, so that's a lot if you ask me.

You should remember, though, that for longer text, data language detection will be more precise. Because of that, you'll probably have your document's language identified correctly. However, because the text of queries is usually short, you'll probably have some degree of errors during query language identification.

Sample document

Let's start with introducing a sample document, which would be as follows:

{
  "title" : "First test document",
  "content" : "This is a test document",
  "lang" : "english"
}

As you can see, the document is pretty simple; it contains three fields:

  • title: Holds the title of the document
  • content: Holds the actual content of the document
  • lang: The language identified

The first two fields are created from our user's documents and the third one is the language our hypothetical user has chosen when he/she uploaded the document.

In order to inform ElasticSearch which analyzer is to be used, we map the lang field to one of the analyzers that exist in ElasticSearch (full list of these analyzers can be found at http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer.html) and if the user enters a language that is not supported, we don't specify the lang field at all, so that ElasticSearch uses the default analyzer.

Mappings

So now, let's look at the mappings created for holding the preceding documents (we stored them in mappings.json):

{
  "mappings" : {
    "doc" : {
      "_analyzer" : {
        "path" : "lang"
      },
      "properties" : {
        "title" : {
          "type" : "multi_field",
          "fields" : {
            "title" : {
              "type" : "string",
              "index" : "analyzed",
              "store" : "no"
            },
            "default" : {
              "type" : "string",
              "index" : "analyzed",
              "store" : "no",
              "analyzer" : "simple"
            }
          }
        },
        "content" : {
          "type" : "multi_field",
          "fields" : {
            "content" : {
              "type" : "string",
              "index" : "analyzed",
              "store" : "no"
            },
            "default" : {
              "type" : "string",
              "index" : "analyzed",
              "store" : "no",
              "analyzer" : "simple"
            }
          }
        },
        "lang" : {
          "type" : "string",
          "index" : "not_analyzed",
          "store" : "yes"
        }
      }
    }
  }
}

In the preceding mappings, the things we are most interested in are the analyzer definition and the title and description fields (if you are not familiar with any parts of mappings, please refer to Chapter 1, Getting Started with ElasticSearch and Chapter 3, Extending Your Structure and Search). We want the analyzer to be based on the lang field. Because of that, we need to add a value in the lang field that is equal to one of the names of the analyzers known to ElasticSearch (the default one or another one defined by us).

Now come the definitions of two fields that hold the actual data. As you can see, we've used the multi field definition in order to index the title and description fields. The first one of the multi fields is indexed with the analyzer specified by the lang field (because we didn't specify the exact analyzer name, so the one defined globally will be used). We will use that field when we know in which language the query is specified. The second of the multi-fields uses a simple analyzer and will be used for searching when a query language is unknown. However, the simple analyzer is only an example and you can also use a standard analyzer or any other of your choice.

In order to create the docs index with the preceding mappings, we used the following command:

curl -XPUT 'localhost:9200/docs' -d @mappings.json

Querying

Now let's see how we can query our data. We can divide the querying situation into two different cases.

Queries with a known language

Let's assume we identified that our user has sent a query written in English and we know that English matches the english analyzer. In such a case, our query could be as follows:

curl -XGET 'localhost:9200/docs/_search?pretty=true ' -d '{
  "query" : {
    "match" : {
      "content" : {
        "query" : "documents",
        "analyzer" : "english"
      }
    }
  }
}'

Notice the analyzer parameter, which indicates which analyzer we need to use. We set that parameter to the name of the analyzer corresponding with the identified language. Notice that the term we are searching for is documents, while the term in the document is document, but the english analyzer should take care of it and find that document:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "docs",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.19178301
    } ]
  }
}

As you can see, that document was found.

Queries with an unknown language

Now let's assume that we don't know the language with which the user is sending the query. In this case, we can't use the field analyzed with the analyzer specified by our lang field, because we don't want to analyze the query with an analyzer that is language-specific. In that case, we will use our standard simple analyzer and we will send the query to the contents.default field instead of content. The query could be as follows:

curl -XGET 'localhost:9200/docs/_search?pretty=true ' -d '{
  "query" : {
    "match" : {
      "content.default" : {
        "query" : "documents",
        "analyzer" : "simple"
      }
    }
  }
}'

However, we didn't get any results this time, because the simple analyzer can't deal in searching with a singular form of a word when we are searching with a plural form.

Combining queries

To additionally boost the documents that perfectly match our default analyzer, we can combine the two preceding queries with the bool query, so that they look like the following:

curl -XGET 'localhost:9200/docs/_search?pretty=true ' -d '{
  "query" : {
    "bool" : {
      "minimum_number_should_match" : 1,
      "should" : [
        {
          "match" : {
            "content" : {
              "query" : "documents",
              "analyzer" : "english"
            }
          }
        },
        {
          "match" : {
            "content.default" : {
              "query" : "documents",
              "analyzer" : "simple"
            }
          }
        }
      ]
    }
  }
}'

At least one of those queries must match, and if both match, the document will have a higher value for the results.

There is one additional advantage to the preceding combined query—if our language analyzer won't find a document (for example, when the analysis is different from the one used during indexing), the second query has a chance to find the terms that are tokenized only by whitespace characters and lowercased.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset