Until Elasticsearch 1.1, we had limited control over the
multi_match
query. Of course, we had the possibility to specify the fields we want our query to be run against; we could use disjunction max queries (by setting the use_dis_max
property to true
). Finally, we could inform Elasticsearch about the importance of each field by using boosting. Our example query run against multiple fields could look as follows:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "multi_match" : { "query" : "complete conan doyle", "fields" : [ "title^20", "author^10", "characters" ] } } }'
A simple query that will match documents having given tokens in any of the mentioned fields. In addition to that required query, the title
field is more important than the author
field, and finally the characters
field.
Of course, we could also use the disjunction max query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "multi_match" : { "query" : "complete conan doyle", "fields" : [ "title^20", "author^10", "characters" ], "use_dis_max" : true } } }'
But apart from the score calculation for the resulting documents, using disjunction max didn't change much.
With the release of Elasticsearch 1.1, the use_dis_max
property was deprecated and Elasticsearch developers introduced a new property—the type
. This property allows control over how the multi_match
query is internally executed. Let's now look at the possibilities of controlling how Elasticsearch runs queries against multiple fields.
To use the best fields type matching, one should set the type
property of the multi_match
query to the best_fields
query. This type of multimatching will generate a match query for each field specified in the fields
property and it is best used for searching for multiple words in the same, best matching field. For example, let's look at the following query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "multi_match" : { "query" : "complete conan doyle", "fields" : [ "title", "author", "characters" ], "type" : "best_fields", "tie_breaker" : 0.8 } } }'
The preceding query would be translated into a query similar to the following one:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "dis_max" : { "queries" : [ { "match" : { "title" : "complete conan doyle" } }, { "match" : { "author" : "complete conan doyle" } }, { "match" : { "characters" : "complete conan doyle" } } ], "tie_breaker" : 0.8 } } }'
If you would look at the results for both of the preceding queries, you would notice the following:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.033352755, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "3", "_score" : 0.033352755, "_source":{ "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"characters": ["Sherlock Holmes","Dr. Watson", "G. Lestrade"],"tags": [],"copies": 0, "available" : false, "section" : 12} } ] } }
Both queries resulted in exactly the same results and the same scores calculated for the document. One thing to remember is how the score is calculated. If the tie_breaker
value is present, the score for each document is the sum of the score for the best matching field and the score of the other matching fields multiplied by the tie_breaker
value. If the tie_breaker
value is not present, the document is assigned the score equal to the score of the best matching field.
There is one more question when it comes to the best_fields
matching: what happens when we would like to use the AND
operator or the minimum_should_match
property? The answer is simple: the best_fields
matching is translated into many match
queries and both the operator
property and the minimum_should_match
property are applied to each of the generated match queries. Because of that, a query as follows wouldn't return any documents in our case:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "multi_match" : { "query" : "complete conan doyle", "fields" : [ "title", "author", "characters" ], "type" : "best_fields", "operator" : "and" } } }'
This is because the preceding query would be translated into:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "dis_max" : { "queries" : [ { "match" : { "title" : { "query" : "complete conan doyle", "operator" : "and" } } }, { "match" : { "author" : { "query" : "complete conan doyle", "operator" : "and" } } }, { "match" : { "characters" : { "query" : "complete conan doyle", "operator" : "and" } } } ] } } }'
And the preceding query looks as follows on the Lucene level:
(+title:complete +title:conan +title:doyle) | (+author:complete +author:conan +author:doyle) | (+characters:complete +characters:conan +characters:doyle)
We don't have any document in the index that has the complete
, conan,
and doyle
terms in a single field. However, if we would like to match the terms in a different field, we can use the cross-field matching.
The cross_fields
type matching is perfect when we want all the terms from the query to be found in the mentioned fields inside the same document. Let's recall our previous query, but this time instead of the best_fields
matching, let's use the cross_fields
matching type:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "multi_match" : { "query" : "complete conan doyle", "fields" : [ "title", "author", "characters" ], "type" : "cross_fields", "operator" : "and" } } }'
This time, the results returned by Elasticsearch were as follows:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.08154379, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "3", "_score" : 0.08154379, "_source":{ "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"characters": ["Sherlock Holmes","Dr. Watson", "G. Lestrade"],"tags": [],"copies": 0, "available" : false, "section" : 12} } ] } }
This is because our query was translated into the following Lucene query:
+(title:complete author:complete characters:complete) +(title:conan author:conan characters:conan) +(title:doyle author:doyle characters:doyle)
The results will only contain documents having all the terms in any of the mentioned fields. Of course, this is only the case when we use the AND
Boolean operator. With the OR
operator, we will get documents having at least a single match in any of the fields.
One more thing that is taken care of when using the cross_fields
type is the problem of different term frequencies for each field. Elasticsearch handles that by blending the term frequencies for all the fields that are mentioned in a query. To put it simply, Elasticsearch gives almost the same weight to all the terms in the fields that are used in a query.
Another type of multi_field
configuration is the most_fields
type. As the official documentation states, it was designed to help run queries against documents that contain the same text analyzed in different ways. One of the examples is having multiple languages in different fields. For example, if we would like to search for books that have die leiden
terms in their title or original title, we could run the following query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "multi_match" : { "query" : "Die Leiden", "fields" : [ "title", "otitle" ], "type" : "most_fields" } } }'
Internally, the preceding request would be translated to the following query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "bool" : { "should" : [ { "match" : { "title" : "die leiden" } }, { "match" : { "otitle" : "die leiden" } } ] } } }'
The resulting documents are given a score equal to the sum of scores from each match query divided by the number of matching match clauses.
The phrase
matching is very similar to the best_fields
matching we already discussed. However, instead of translating the query using match queries, it uses match_phrase
queries. Let's take a look at the following query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "multi_match" : { "query" : "sherlock holmes", "fields" : [ "title", "author" ], "type" : "phrase" } } }'
Because we use the phrase
matching, it would be translated into the following:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "dis_max" : { "queries" : [ { "match_phrase" : { "title" : "sherlock holmes" } }, { "match_phrase" : { "author" : "sherlock holmes" } } ] } } }'
This is exactly the same as the phrase matching, but instead of using match_phrase
query, the match_phrase_prefix
query is used. Let's assume we run the following query:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "multi_match" : { "query" : "sherlock hol", "fields" : [ "title", "author" ], "type" : "phrase_prefix" } } }'
What Elasticsearch would do internally is run a query similar to the following one:
curl -XGET 'localhost:9200/library/_search?pretty' -d '{ "query" : { "dis_max" : { "queries" : [ { "match_phrase_prefix" : { "title" : "sherlock hol" } }, { "match_phrase_prefix" : { "author" : "sherlock hol" } } ] } } }'
As you can see, by using the type
property of the multi_match
query, you can achieve different results without the need of writing complicated queries. What's more, Elasticsearch will also take care of the scoring and problems related to it.