Chapter 6. Low-level Index Control

In the previous chapter, we talked about general shards and the index architecture. We started by learning how to choose the right amount of shards and replicas, and we used routing during indexing and querying, and in conjunction with aliases. We also discussed shard allocation behavior adjustments, and finally, we looked at what query execution preference can bring us.

In this chapter, we will take a deeper dive into more low-level aspects of handling shards in Elasticsearch. By the end of this chapter, you will have learned:

  • Altering the Apache Lucene scoring by using different similarity models
  • Altering index writing by using codes
  • Near real-time indexing and querying
  • Data flushing, index refresh, and transaction log handling
  • I/O throttling
  • Segment merge control and visualization
  • Elasticsearch caching

Altering Apache Lucene scoring

With the release of Apache Lucene 4.0 in 2012, all the users of this great full text search library were given the opportunity to alter the default TF/IDF-based algorithm. The Lucene API was changed to allow easier modification and extension of the scoring formula. However, this was not the only change that was made to Lucene when it comes to documents' score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use a different scoring formula for our documents. In this section, we will take a deeper look at what Lucene 4.0 brings and how these features were incorporated into Elasticsearch.

Available similarity models

As already mentioned, the original and default similarity model available before Apache Lucene 4.0 was the TF/IDF model. We already discussed it in detail in the Default Apache Lucene scoring explained section in Chapter 2, Power User Query DSL.

The five new similarity models that we can use are:

  • Okapi BM25: This similarity model is based on a probabilistic model that estimates the probability of finding a document for a given query. In order to use this similarity in Elasticsearch, you need to use the BM25 name. The Okapi BM25 similarity is said to perform best when dealing with short text documents where term repetitions are especially hurtful to the overall document score.
  • Divergence from randomness (DFR): This similarity model is based on the probabilistic model of the same name. In order to use this similarity in Elasticsearch, you need to use the DFR name. It is said that the divergence from the randomness similarity model performs well on text similar to natural language text.
  • Information-based: This is very similar to the model used by Divergence from randomness. In order to use this similarity in Elasticsearch, you need to use the IB name. Similar to the DFR similarity, it is said that the information-based model performs well on data similar to natural language text.
  • LM Dirichlet: This similarity model uses Bayesian smoothing with Dirichlet priors. To use this similarity, we need to use the LMDirichlet name. More information about it can be found at https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html.
  • LM Jelinek Mercer: This similarity model is based on the Jelinek Mercer smoothing method. To use this similarity, we need to use the LMJelinekMercer name. More information about it can be found at https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html.

Note

All the mentioned similarity models require mathematical knowledge to fully understand them and a deep explanation of these models is far beyond the scope of this book. However, if you would like to explore these models and increase your knowledge about them, please go to http://en.wikipedia.org/wiki/Okapi_BM25 for the Okapi BM25 similarity and http://terrier.org/docs/v3.5/dfr_description.html for divergence from the randomness similarity.

Setting a per-field similarity

Since Elasticsearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let's assume that we have the following simple mappings that we use in order to index blog posts (stored in the posts_no_similarity.json file):

{
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
   }
  }
 }
}

What we would like to do is use the BM25 similarity model for the name field and the contents field. In order to do this, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json file) would look like this:

{
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed", "similarity" : "BM25" }
   }
  }
 }
}

That's all; nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and contents fields.

Note

Please note that in the case of the Divergence from randomness and Information-based similarities, we need to configure some additional properties to specify these similarities' behavior. How to do that is covered in the next part of the current section.

Similarity model configuration

As we now know how to set the desired similarity for each field in our index, it's time to see how to configure them if we need them, which is actually pretty easy. What we need to do is use the index settings section to provide an additional similarity section, for example, like this (this example is stored in the posts_custom_similarity.json file):

{
 "settings" : {
  "index" : {
   "similarity" : {
    "mastering_similarity" : {
     "type" : "default",
     "discount_overlaps" : false
    }
   }
  }
 },
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "mastering_similarity" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
   }
  }
 }
}

You can, of course, have more than one similarity configuration, but let's focus on the preceding example. We've defined a new similarity model named mastering_similarity, which is based on the default similarity, which is the TF/IDF one. We've set the discount_overlaps property to false for this similarity, and we've used it as the similarity for the name field. We'll talk about what properties can be used for different similarities further in this section. Now, let's see how to change the default similarity model Elasticsearch will use.

Choosing the default similarity model

In order to change the similarity model used by default, we need to provide a configuration of a similarity model that will be called default. For example, if we would like to use our mastering_similarity "name" as the default one, we would have to change the preceding configuration to the following one (the whole example is stored in the posts_default_similarity.json file):

{
 "settings" : {
  "index" : {
   "similarity" : {
    "default" : {
     "type" : "default",
     "discount_overlaps" : false
    }
   }
  }
 },
 ...
}

Because of the fact that the query norm and coordination factors (which were explained in the Default Apache Lucene scoring explained section in Chapter 2, Power User Query DSL) are used by all similarity models globally and are taken from the default similarity, Elasticsearch allows us to change them when needed. To do this, we need to define another similarity—one called base. It is defined exactly the same as what we've shown previously, but instead of setting its name to default, we set it to base, just like this (the whole example is stored in the posts_base_similarity.json file):

{
 "settings" : {
  "index" : {
   "similarity" : {
    "base" : {
     "type" : "default",
     "discount_overlaps" : false
    }
   }
  }
 },
 ...
}

If the base similarity is present in the index configuration, Elasticsearch will use it to calculate the query norm and coord factors when calculating the score using other similarity models.

Configuring the chosen similarity model

Each of the newly introduced similarity models can be configured to match our needs. Elasticsearch allows us to use the default and BM25 similarities without any configuration, because they are preconfigured for us. In the case of DFR and IB, we need to provide the configuration in order to use them. Let's now see what properties each of the similarity models' implementation provides.

Configuring the TF/IDF similarity

In the case of the TF/IDF similarity, we are allowed to set only a single parameter—discount_overlaps, which defaults to true. By default, the tokens that have their position increment set to 0 (and therefore, are placed at the same position as the one before them) will not be taken into consideration when calculating the score. If we want them to be taken into consideration, we need to configure the similarity with the discount_overlaps property set to false.

Configuring the Okapi BM25 similarity

In the case of the Okapi BM25 similarity, we have these parameters: we can configure k1 (controls the saturation—nonlinear term frequency normalization) as a float value, b (controls how the document length affects the term frequency values) as a float value, and discount_overlaps, which is exactly the same as in TF/IDF similarity.

Configuring the DFR similarity

In the case of the DFR similarity, we have these parameters that we can configure: basic_model (which can take the value be, d, g, if, in, or ine), after_effect (with values of no, b, and l), and the normalization (which can be no, h1, h2, h3, or z). If we choose a normalization other than no, we need to set the normalization factor. Depending on the chosen normalization, we should use normalization.h1.c (the float value) for the h1 normalization, normalization.h2.c (the float value) for the h2 normalization, normalization.h3.c (the float value) for the h3 normalization, and normalization.z.z (the float value) for the z normalization. For example, this is what the example similarity configuration could look like:

"similarity" : {
 "esserverbook_dfr_similarity" : {
  "type" : "DFR",
  "basic_model" : "g",
  "after_effect" : "l",
  "normalization" : "h2",
  "normalization.h2.c" : "2.0"
 }
}

Configuring the IB similarity

In the case of the IB similarity, we have these parameters that we can configure: the distribution property (which can take the value of ll or spl) and the lambda property (which can take the value of df or tff). In addition to this, we can choose the normalization factor, which is the same as the one used for the DFR similarity, so we'll omit describing it for the second time. This is what the example IB similarity configuration could look like:

"similarity" : {
 "esserverbook_ib_similarity" : {
  "type" : "IB",
  "distribution" : "ll",
  "lambda" : "df",
  "normalization" : "z",
  "normalization.z.z" : "0.25"
 }
}

Configuring the LM Dirichlet similarity

In the case of the LM Dirichlet similarity, we have the mu property that we can configure the mu property, which is by default set to 2000. An example configuration of this could look as follows:

"similarity" : {
 "esserverbook_lm_dirichlet_similarity" : {
  "type" : "LMDirichlet",
  "mu" : "1000"
 }
}

Configuring the LM Jelinek Mercer similarity

When it comes to the LM Jelinek Mercer similarity, we can configure the lambda property, which is set to 0.1 by default. An example configuration of this could look as follows:

"similarity" : {
 "esserverbook_lm_jelinek_mercer_similarity" : {
  "type" : "LMJelinekMercer",
  "lambda" : "0.7"
 }
}

Note

It is said that for short fields (like the document title) the optimal lambda value is around 0.1, while for long fields the lambda should be set to 0.7.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset