In the previous chapter, we talked about general shards and the index architecture. We started by learning how to choose the right amount of shards and replicas, and we used routing during indexing and querying, and in conjunction with aliases. We also discussed shard allocation behavior adjustments, and finally, we looked at what query execution preference can bring us.
In this chapter, we will take a deeper dive into more low-level aspects of handling shards in Elasticsearch. By the end of this chapter, you will have learned:
With the release of Apache Lucene 4.0 in 2012, all the users of this great full text search library were given the opportunity to alter the default TF/IDF-based algorithm. The Lucene API was changed to allow easier modification and extension of the scoring formula. However, this was not the only change that was made to Lucene when it comes to documents' score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use a different scoring formula for our documents. In this section, we will take a deeper look at what Lucene 4.0 brings and how these features were incorporated into Elasticsearch.
As already mentioned, the original and default similarity model available before Apache Lucene 4.0 was the TF/IDF model. We already discussed it in detail in the Default Apache Lucene scoring explained section in Chapter 2, Power User Query DSL.
The five new similarity models that we can use are:
BM25
name. The Okapi BM25 similarity is said to perform best when dealing with short text documents where term repetitions are especially hurtful to the overall document score.IB
name. Similar to the DFR similarity, it is said that the information-based model performs well on data similar to natural language text.LMDirichlet
name. More information about it can be found at https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html.LMJelinekMercer
name. More information about it can be found at https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html.All the mentioned similarity models require mathematical knowledge to fully understand them and a deep explanation of these models is far beyond the scope of this book. However, if you would like to explore these models and increase your knowledge about them, please go to http://en.wikipedia.org/wiki/Okapi_BM25 for the Okapi BM25 similarity and http://terrier.org/docs/v3.5/dfr_description.html for divergence from the randomness similarity.
Since Elasticsearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let's assume that we have the following simple mappings that we use in order to index blog posts (stored in the posts_no_similarity.json
file):
{ "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } }
What we would like to do is use the BM25
similarity model for the name
field and the contents
field. In order to do this, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json
file) would look like this:
{ "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed", "similarity" : "BM25" } } } } }
That's all; nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name
and contents
fields.
As we now know how to set the desired similarity for each field in our index, it's time to see how to configure them if we need them, which is actually pretty easy. What we need to do is use the index settings section to provide an additional similarity section, for example, like this (this example is stored in the posts_custom_similarity.json
file):
{ "settings" : { "index" : { "similarity" : { "mastering_similarity" : { "type" : "default", "discount_overlaps" : false } } } }, "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "mastering_similarity" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } }
You can, of course, have more than one similarity configuration, but let's focus on the preceding example. We've defined a new similarity model named mastering_similarity
, which is based on the default similarity, which is the TF/IDF one. We've set the discount_overlaps
property to false
for this similarity, and we've used it as the similarity for the name
field. We'll talk about what properties can be used for different similarities further in this section. Now, let's see how to change the default similarity model Elasticsearch will use.
In order to change the similarity model used by default, we need to provide a configuration of a similarity model that will be called default
. For example, if we would like to use our mastering_similarity
"name" as the default one, we would have to change the preceding configuration to the following one (the whole example is stored in the posts_default_similarity.json
file):
{
"settings" : {
"index" : {
"similarity" : {
"default" : {
"type" : "default",
"discount_overlaps" : false
}
}
}
},
...
}
Because of the fact that the query norm and coordination factors (which were explained in the Default Apache Lucene scoring explained section in Chapter 2, Power User Query DSL) are used by all similarity models globally and are taken from the default similarity, Elasticsearch allows us to change them when needed. To do this, we need to define another similarity—one called base. It is defined exactly the same as what we've shown previously, but instead of setting its name to default, we set it to base, just like this (the whole example is stored in the posts_base_similarity.json
file):
{
"settings" : {
"index" : {
"similarity" : {
"base" : {
"type" : "default",
"discount_overlaps" : false
}
}
}
},
...
}
If the base similarity is present in the index configuration, Elasticsearch will use it to calculate the query norm
and coord
factors when calculating the score using other similarity models.
Each of the newly introduced similarity models can be configured to match our needs. Elasticsearch allows us to use the default and BM25
similarities without any configuration, because they are preconfigured for us. In the case of DFR
and IB
, we need to provide the configuration in order to use them. Let's now see what properties each of the similarity models' implementation provides.
In the case of the TF/IDF similarity, we are allowed to set only a single parameter—discount_overlaps
, which defaults to true
. By default, the tokens that have their position increment set to 0
(and therefore, are placed at the same position as the one before them) will not be taken into consideration when calculating the score. If we want them to be taken into consideration, we need to configure the similarity with the discount_overlaps
property set to false
.
In the case of the Okapi BM25 similarity, we have these parameters: we can configure k1
(controls the saturation—nonlinear term frequency normalization) as a float value, b
(controls how the document length affects the term frequency values) as a float value, and discount_overlaps
, which is exactly the same as in TF/IDF similarity.
In the case of the DFR similarity, we have these parameters that we can configure: basic_model
(which can take the value be
, d
, g
, if
, in
, or ine
), after_effect
(with values of no
, b
, and l
), and the normalization (which can be no
, h1
, h2
, h3
, or z
). If we choose a normalization other than no
, we need to set the normalization factor. Depending on the chosen normalization, we should use normalization.h1.c
(the float value) for the h1
normalization, normalization.h2.c
(the float value) for the h2
normalization, normalization.h3.c
(the float value) for the h3
normalization, and normalization.z.z
(the float value) for the z
normalization. For example, this is what the example similarity configuration could look like:
"similarity" : { "esserverbook_dfr_similarity" : { "type" : "DFR", "basic_model" : "g", "after_effect" : "l", "normalization" : "h2", "normalization.h2.c" : "2.0" } }
In the case of the IB similarity, we have these parameters that we can configure: the distribution property (which can take the value of ll
or spl
) and the lambda property (which can take the value of df
or tff
). In addition to this, we can choose the normalization factor, which is the same as the one used for the DFR similarity, so we'll omit describing it for the second time. This is what the example IB similarity configuration could look like:
"similarity" : { "esserverbook_ib_similarity" : { "type" : "IB", "distribution" : "ll", "lambda" : "df", "normalization" : "z", "normalization.z.z" : "0.25" } }
In the case of the LM Dirichlet similarity, we have the mu
property that we can configure the mu
property, which is by default set to 2000
. An example configuration of this could look as follows:
"similarity" : { "esserverbook_lm_dirichlet_similarity" : { "type" : "LMDirichlet", "mu" : "1000" } }