Controlling multimatching

Until Elasticsearch 1.1, we had limited control over the multi_match query. Of course, we had the possibility to specify the fields we want our query to be run against; we could use disjunction max queries (by setting the use_dis_max property to true). Finally, we could inform Elasticsearch about the importance of each field by using boosting. Our example query run against multiple fields could look as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "multi_match" : {
   "query" : "complete conan doyle",
   "fields" : [ "title^20", "author^10", "characters" ]
  }
 }
}'

A simple query that will match documents having given tokens in any of the mentioned fields. In addition to that required query, the title field is more important than the author field, and finally the characters field.

Of course, we could also use the disjunction max query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "multi_match" : {
   "query" : "complete conan doyle",
   "fields" : [ "title^20", "author^10", "characters" ],
   "use_dis_max" : true
  }
 }
}'

But apart from the score calculation for the resulting documents, using disjunction max didn't change much.

Multimatch types

With the release of Elasticsearch 1.1, the use_dis_max property was deprecated and Elasticsearch developers introduced a new property—the type. This property allows control over how the multi_match query is internally executed. Let's now look at the possibilities of controlling how Elasticsearch runs queries against multiple fields.

Note

Please note that the tie_breaker property was not deprecated and we can still use it without worrying about future compatibility.

Best fields matching

To use the best fields type matching, one should set the type property of the multi_match query to the best_fields query. This type of multimatching will generate a match query for each field specified in the fields property and it is best used for searching for multiple words in the same, best matching field. For example, let's look at the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "multi_match" : {
   "query" : "complete conan doyle",
   "fields" : [ "title", "author", "characters" ],
   "type" : "best_fields",
   "tie_breaker" : 0.8
  }
 }
}'

The preceding query would be translated into a query similar to the following one:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "dis_max" : {
   "queries" : [
    {
     "match" : {
      "title" : "complete conan doyle"
     }
    },
    {
     "match" : {
      "author" : "complete conan doyle"
     }
    },
    {
     "match" : {
      "characters" : "complete conan doyle"
     }
    }
   ],
   "tie_breaker" : 0.8
  }
 }
}'

If you would look at the results for both of the preceding queries, you would notice the following:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.033352755,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 0.033352755,
      "_source":{ "title": "The Complete Sherlock  Holmes","author": "Arthur Conan Doyle","year":  1936,"characters": ["Sherlock Holmes","Dr. Watson", "G.  Lestrade"],"tags": [],"copies": 0, "available" : false,  "section" : 12}
    } ]
  }
}

Both queries resulted in exactly the same results and the same scores calculated for the document. One thing to remember is how the score is calculated. If the tie_breaker value is present, the score for each document is the sum of the score for the best matching field and the score of the other matching fields multiplied by the tie_breaker value. If the tie_breaker value is not present, the document is assigned the score equal to the score of the best matching field.

There is one more question when it comes to the best_fields matching: what happens when we would like to use the AND operator or the minimum_should_match property? The answer is simple: the best_fields matching is translated into many match queries and both the operator property and the minimum_should_match property are applied to each of the generated match queries. Because of that, a query as follows wouldn't return any documents in our case:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "multi_match" : {
   "query" : "complete conan doyle",
   "fields" : [ "title", "author", "characters" ],
   "type" : "best_fields",
   "operator" : "and"
  }
 }
}'

This is because the preceding query would be translated into:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "dis_max" : {
   "queries" : [
    {
     "match" : {
      "title" : {
       "query" : "complete conan doyle",
       "operator" : "and"
      }
     }
    },
    {
     "match" : {
      "author" : {
       "query" : "complete conan doyle",
       "operator" : "and"
      }
     }
    },
    {
     "match" : {
      "characters" : {
       "query" : "complete conan doyle",
       "operator" : "and"
      }
     }
    }
   ]
  }
 }
}'

And the preceding query looks as follows on the Lucene level:

(+title:complete +title:conan +title:doyle) | (+author:complete  +author:conan +author:doyle) | (+characters:complete  +characters:conan +characters:doyle)

We don't have any document in the index that has the complete, conan, and doyle terms in a single field. However, if we would like to match the terms in a different field, we can use the cross-field matching.

Cross fields matching

The cross_fields type matching is perfect when we want all the terms from the query to be found in the mentioned fields inside the same document. Let's recall our previous query, but this time instead of the best_fields matching, let's use the cross_fields matching type:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "multi_match" : {
   "query" : "complete conan doyle",
   "fields" : [ "title", "author", "characters" ],
   "type" : "cross_fields",
   "operator" : "and"
  }
 }
}'

This time, the results returned by Elasticsearch were as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.08154379,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "3",
      "_score" : 0.08154379,
      "_source":{ "title": "The Complete Sherlock  Holmes","author": "Arthur Conan Doyle","year":  1936,"characters": ["Sherlock Holmes","Dr. Watson", "G.  Lestrade"],"tags": [],"copies": 0, "available" : false,  "section" : 12}
    } ]
  }
}

This is because our query was translated into the following Lucene query:

+(title:complete author:complete characters:complete)  +(title:conan author:conan characters:conan) +(title:doyle  author:doyle characters:doyle)

The results will only contain documents having all the terms in any of the mentioned fields. Of course, this is only the case when we use the AND Boolean operator. With the OR operator, we will get documents having at least a single match in any of the fields.

One more thing that is taken care of when using the cross_fields type is the problem of different term frequencies for each field. Elasticsearch handles that by blending the term frequencies for all the fields that are mentioned in a query. To put it simply, Elasticsearch gives almost the same weight to all the terms in the fields that are used in a query.

Most fields matching

Another type of multi_field configuration is the most_fields type. As the official documentation states, it was designed to help run queries against documents that contain the same text analyzed in different ways. One of the examples is having multiple languages in different fields. For example, if we would like to search for books that have die leiden terms in their title or original title, we could run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "multi_match" : {
   "query" : "Die Leiden",
   "fields" : [ "title", "otitle" ],
   "type" : "most_fields"
  }
 }
}'

Internally, the preceding request would be translated to the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "bool" : {
   "should" : [
    {
     "match" : {
      "title" : "die leiden"
     }
    },
    {
     "match" : {
      "otitle" : "die leiden"
     }
    }
   ]
  }
 }
}'

The resulting documents are given a score equal to the sum of scores from each match query divided by the number of matching match clauses.

Phrase matching

The phrase matching is very similar to the best_fields matching we already discussed. However, instead of translating the query using match queries, it uses match_phrase queries. Let's take a look at the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "multi_match" : {
   "query" : "sherlock holmes",
   "fields" : [ "title", "author" ],
   "type" : "phrase"
  }
 }
}'

Because we use the phrase matching, it would be translated into the following:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "dis_max" : {
   "queries" : [
    {
     "match_phrase" : {
      "title" : "sherlock holmes"
     }
    },
    {
     "match_phrase" : {
      "author" : "sherlock holmes"
     }
    }
   ]
  }
 }
}'

Phrase with prefixes matching

This is exactly the same as the phrase matching, but instead of using match_phrase query, the match_phrase_prefix query is used. Let's assume we run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "multi_match" : {
   "query" : "sherlock hol",
   "fields" : [ "title", "author" ],
   "type" : "phrase_prefix"
  }
 }
}'

What Elasticsearch would do internally is run a query similar to the following one:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "query" : {
  "dis_max" : {
   "queries" : [
    {
     "match_phrase_prefix" : {
      "title" : "sherlock hol"
     }
    },
    {
     "match_phrase_prefix" : {
      "author" : "sherlock hol"
     }
    }
   ]
  }
 }
}'

As you can see, by using the type property of the multi_match query, you can achieve different results without the need of writing complicated queries. What's more, Elasticsearch will also take care of the scoring and problems related to it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset