So, we now know what an ElasticSearch query is, how to construct it, and finally, how to send it using an HTTP request. What we don't know yet is what kind of queries ElasticSearch exposes, and thus, what we can use in order to achieve the desired results. In the next few pages of this chapter, we will try to learn which basic queries ElasticSearch allows us to use and what we can do with them.
The term query is one of the simplest queries in ElasticSearch and just matches any document that has a term in a given field. You are familiar with this query type because we used it already, but just to have all the query types in one place. The simplest term query is as follows:
{ "query" : { "term" : { "title" : "crime" } } }
It will match the documents that have the term "crime" in the title
field. Please remember that the term query is not analyzed, so you need to provide the exact term that will match the term in the indexed document. However, you can also include the boost
attribute in your term query; this will affect the importance of the given term. For example, if we wanted to change our previous query and give our term query a boost of 10.0
, we would send the following query:
{ "query" : { "term" : { "title" : { "value" : "crime", "boost" : 10.0 } } } }
As you can see, the query changes a bit. Instead of a simple term value we nest a new JSON object, which contains the value
property and the boost
property. The value of the value
property should contain the term we are interested in and the boost
property is the boost value we want to use.
This is a query
that allows us to match documents that have certain terms in their contents. For example, let's say that we want to get all the documents that have the terms "novel" or "book" in the tags
field. To achieve that, we could run the following query:
{ "query" : { "terms" : { "tags" : [ "novel", "book" ], "minimum_match" : 1 } } }
The preceding query returns all the documents that have one or both of the searched terms in the tags
field. Why is that? Because we set the minimum_match
property to 1
, which basically means that one term should be matched. If we wanted the query to match only a document with both the provided terms, we would set the minimum_match
propert to 2
.
The match query takes the values given in the query parameter, analyzes them, and constructs the appropriate query out of them. When using a match query, ElasticSearch will choose the proper analyzer for a field we've chosen, so you can be sure that the terms passed to the match query will be processed by the same analyzer that was used during indexing. Please remember that the match query (and as further explained, the multi match query) doesn't support Lucene query syntax (discussed in the The query string query section, later in this chapter); however, it fits perfectly as a query handler for your search box. The simplest (and the default) match query can look like this:
{ "query" : { "match" : { "title" : "crime and punishment" } } }
The preceding query would match all the documents that have the terms "crime" or "and" or "punishment" in the title
field. However, the preceding query is only the simplest one; there are multiple types of match query. They are covered in the following sections.
The Boolean match query is a query that analyzes the provided text and makes a Boolean query out of it. There are a few parameters that allow us to control the behavior of Boolean match queries:
operator
: This can take the value of or
or and
and control what Boolean operator is used to connect the created Boolean clauses. The default value is or
.analyzer
: This specifies the name of the analyzer that will be used to analyze the query text and defaults to the default analyzer.fuzziness
: Providing the value of this parameter allows one to construct fuzzy queries. It should take values from 0.0
to 1.0
for a string
object. While constructing fuzzy queries, this parameter will be used to set the similarity.prefix_length
: This allows one to control the behavior of the fuzzy query. For more information on the value of this parameter, please see the The fuzzy like this query section in this chapter.max_expansions
: This allows one to control the behavior of the fuzzy query. For more information on the value of this parameter, please see the The fuzzy like this query section in this chapter.title
field, we could send a query like so:{ "query" : { "match" : { "title" : { "query" : "crime and punishment", "operator" : "and" } } } }
A phrase match query is similar to the Boolean query, but instead of constructing the Boolean clauses from the analyzed text, it constructs a phrase query. The following parameters are available:
slop
: This is an integer value that defines how many unknown words can be put between terms in the text query for a match to be considered a phrase.analyzer
: This specifies the name of the analyzer that will be used to analyze the query text and defaults to the default analyzer.A sample phrase match query against the title
field could look like the following code:
{ "query" : { "match_phrase" : { "title" : { "query" : "crime and punishment", "slop" : 1 } } } }
The last type of the match query is the match phrase prefix query. This query is almost the same as the prefix match query, but in addition, it allows prefix matches on the last term in the query text. Also, in addition to the parameters exposed by the match phrase query, it exposes an additional one, the max_expansions
parameter, which controls how many prefixes the last terms will be rewritten to. Our sample query changed to the match phrase prefix query could look like this:
{ "query" : { "match_phrase_prefix" : { "title" : { "query" : "crime and punishment", "slop" : 1, "max_expansions" : 20 } } } }
This is the same as the match query, but instead of running against a single field, it can be run against multiple
fields with the use of the fields
parameter. Of course, all the parameters you use with the match query can be used with the multi match query. So, if we want to modify our match query to be run against the title
and otitle
fields, we could run the following query:
{ "query" : { "multi_match" : { "query" : "crime punishment", "fields" : [ "title", "otitle" ] } } }
In comparison with the other queries available , the query string query supports full Apache Lucene query syntax, so it uses a query parser to construct an actual query using the provided text. A sample query string query can look like this:
{ "query" : { "query_string" : { "query" : "title:crime^10 +title:punishment -otitle:cat +author:(+Fyodor +dostoevsky)", "default_field" : "title" } } }
You may wonder what that weird syntax in the query
parameter is; we will get to it in the Lucene query syntax part of the query string query description.
As with most of the queries in ElasticSearch, the query string query provides a few parameters that allow us to control query behavior:
query
: This specifies the query text.default_field
: This specifies the default field the query will be executed against. It defaults to the index.query.default_field
property, which is by default set to _all
.default_operator
: This specifies the default logical operator (or
/and
) used when no operator is specified. The default value of this parameter is or
.analyzer
: This specifies the name of the analyzer used to analyze the query provided in the query
parameter.allow_leading_wildcard
: This specifies whether a wildcard allowed as the first character of a term; it defaults to true
.lowercase_expand_terms
: This specifies whether terms rewritten by the query are lowercased. It defaults to true
.enable_position_increments
: This specifies whether position increments are turned on in the result query. It defaults to true
.fuzzy_prefix_length
: This is the prefix length for generated fuzzy queries, and it defaults to 0
. To learn more about it, please look at the The fuzzy query section.fizzy_min_sim
: This specifies the minimum similarity for fuzzy queries and defaults to 0.5
. To learn more about it, please look at the The fuzzy query section.phrase_slop
: This specifies the phrase slop and defaults to 0
. To learn more about it, please look at the The phrase match query section.boost
: This is the boost value used and defaults to 1.0
.analyze_wildcard
: This specifies whether the wildcard characters should be analyzed. It defaults to true
.auto_generate_phrase_queries
: This specifies whether phrase queries should be automatically generated. It defaults to false
.minimum_should_match
: This controls how many of the generated Boolean clauses should match to consider a hit for a given document. The value should be provided as a percentage, for example 50%
, which would mean that at least 50 percent of the given terms should match.lenient
: This parameter can take the value of true
or false
. If it is set to true
, format-based failures will be ignored.Please note that the query string query can be rewritten by ElasticSearch, and because of that, ElasticSearch allows us to pass additional parameters that control the rewrite method. However, for more details about that process, see the Query rewrite section later in this chapter.
As we have already discussed, Apache Lucene is the full text search library on top of which ElasticSearch is built. Because of that, some of the queries in ElasticSearch (such as the one currently discussed) support Lucene query parsers syntax—the language that allows you to construct queries. Let's take a look at it and discuss some basic features of it. To read about full Lucene query syntax, please visit http://lucene.apache.org/core/3_6_1/queryparsersyntax.html.
A query we pass to Lucene is divided into terms and operators by the query parser. Let's start with the terms; you can distinguish them into two types, single terms and phrases. For example, to query the term "book" in the title
field, we would pass the following query:
title:book
To query the phrase "elasticsearch book" in the title
field, we would pass the following query:
title:"elasticsearch book"
You may have noticed the name of the field in the beginning and the term or phrase later.
As we have already said, Lucene query syntax supports operators. For example, the +
operator tells Lucene that the given part must be matched against the document to consider that document a match, while the -
operator is the opposite, which means that such a part of the query can't be present in the document. A part of the query without the +
or -
operator will be treated as part of the query that can be matched, but it is not mandatory. So, if we wanted to find a document with the term "book" in the title
field and without the term "cat" in the description
field, we would pass the following query:
+title:book -description:cat
We can also group multiple terms with parentheses, for example, the following query:
title:(crime punishment)
We can also boost parts of the query with the ^
operator and the boost value after it. For example, the following query:
title:book^4
So now that we know the basics of the Lucene query syntax, let's get back to the query we sent using the query_string
query. As you can see, we wanted to get the documents that may have the term "crime" in the title
field, and such documents should be boosted with the value of 10
. Next, we want only the documents that have the term "punishment" in the title
field, and we don't want documents with the term "cat" in the otitle
field. Finally, we tell Lucene that we only want the documents that have the terms "fyodor" and "dostoevsky" in the author
field.
It is possible to run the query string query against multiple fields. In order to do that, one needs to provide the fields
parameter in the query body, which should hold an array of field names. There are two methods of running the query string query against multiple fields; the default method will use the Boolean query to make queries, and the other method can use the DisMax query.
DisMax is an abbreviation of Disjunction Max. The "Disjunction" part refers to the fact that the search is executed across multiple fields and the fields can be given different boost weights. The "Max" part means that only the maximum score for a given term will be included in a final document score, not the sum of all the scores from all fields that have the matched term (which is what the a simple Boolean query would do).
In order to use the DisMax query, one should add the use_dis_max
property in the query body and set it to true
. A sample query can look like this:
{ "query" : { "query_string" : { "query" : "crime punishment", "fields" : [ "title", "otitle" ], "use_dis_max" : true } } }
The field query is a simplified
version of the query string query that we just discussed. I would only like to find all the documents that have the term "crime" in the title
field, that may have the term "nothing" and that don't have the term "let" in the same field. For that, we could run the following query:
{ "query" : { "field" : { "title" : "+crime nothing -let" } } }
You can also apply all the properties
that apply to the query string query. To do that, we should wrap all the parameters in the field name and pass the actual query in the query
parameter. So, the preceding query with the boost
parameter added would look like this:
{ "query" : { "field" : { "title" : { "query" : "+crime nothing -let", "boost" : 20.0 } } } }
This is a simple query that filters
the returned documents to only those with the provided identifiers. It works on the internal _uid
field, so it doesn't require the _id
field to be enabled. The simplest version of such a query could look like the following query:
{ "query" : { "ids" : { "values" : [ "10", "11", "12", "13" ] } } }
This query would only return documents that have one of the identifiers present in the values
array. We can complicate the identifiers query a bit and also limit the documents on the basis of their type. For example, if we want to only include documents from the book
type, we could send the following query:
{
"query" : {
"ids" : {
"type" : "book",
"values" : [ "10", "11", "12", "13" ]
}
}
}
The prefix query is similar to the term query in terms of configuration and to the multi term query when looking into
its logic. The prefix query allows us to match documents that have a value in a certain field that starts with a given prefix. For example, if we want to find all the documents that have values starting with cri
in the title
field, we could run the following query:
{ "query" : { "prefix" : { "title" : "cri" } } }
As with the term query, you can also include the boost
attribute with your prefix query; this will affect the importance of the given prefix. For example, if we wanted to change our previous query and give it a boost of 3.0
, we would send the following query:
{ "query" : { "prefix" : { "title" : { "value" : "cri", "boost" : 3.0 } } } }
The fuzzy like this query
is similar to the more like this query. It finds all the documents that are similar to the provided text but works a bit differently from the more like this query because it makes use of fuzzy strings and picks the best differencing terms produced. For example, if we want to run a fuzzy like this query against the title
and otitle
fields and find all the documents similar to the crime punishment
query, we could run the following query:
{ "query" : { "fuzzy_like_this" : { "fields" : ["title", "otitle"], "like_text" : "crime punishment" } } }
The following query parameters are supported:
fields
: This is an array of fields that the query should be run against. It defaults to the _all
field.like_text
: This is a required parameter that holds the text we compare the documents to.ignore_tf
: This specifies whether term frequencies be ignored; this parameter defaults to false
.max_query_terms
: This specifies the maximum number of query terms that will be included in a generated query. It defaults to 25
.min_similarity
: This specifies the minimum similarity that differencing terms should have. It defaults to 0.5
.prefix_length
: This specifies the length of the common prefix of the differencing terms. It defaults to 0.
boost
: This is the boost value that will be used when boosting queries. It defaults to 1
.analyzer
: This specifies the name of the analyzer that will be used to analyze the text we provided.The fuzzy like this field query is
similar to the fuzzy like this query but works only against a single field, and because of that, it doesn't support the fields
property. Instead of specifying the fields that should be used for query analysis, we should wrap the query parameters into the field name. Our sample query to a title
field should look like the following code:
{ "query" : { "fuzzy_like_this_field" : { "title" : { "like_text" : "crime and punishment" } } } }
All the other parameters from the fuzzy like this query work the same for this type of query.
The third type of fuzzy query
matches documents on the basis of the edit distance algorithm that is calculated on the terms we provide against the searched documents. This query can be expensive when it comes to CPU resources but can help us when we need fuzzy matching, for example, when users make spelling mistakes. In our example, let's assume that, instead of crime
, our user enters cirme
into the search box and we would like to run the simplest form of fuzzy query. Such a query could look like this:
{
"query" : {
"fuzzy" : {
"title" : "cirme"
}
}
}
And the response for this query would be as follows:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.625, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.625, "_source" : { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true} } ] } }
As you can see, even though we made a typo, ElasticSearch managed to find the document we were interested in.
You can control the fuzzy query behavior by using the following parameters:
value
: This specifies the actual query (in case we want to pass more parameters). boost
: This specifies the boost value for the query. It defaults to 1.0
.min_similarity
: This specifies the minimum similarity for a term to be counted as a match. In the case of string fields, this value should be between 0 and 1, both inclusive. For numeric fields, this value can be greater than one, for example, for a query with value
equal to 20
and min_similarity
set to 3
, we would get values from 17
to 23
. For date fields, we can have min_similarity
values that include 1d,
2d
, and 1m
. These values correspond to one day, two days, and one month, and so on.prefix_length:
This is the length of the common prefix of the differencing terms, which defaults to 0
.max_expansions
: This specifies the number of terms the query will be expanded to. The default value is unbounded.The parameters should be wrapped in the name of the field we are running the query against. So, if we would like to modify the previous query and add additional parameters, the query could look like the following code:
{ "query" : { "fuzzy" : { "title" : { "value" : "cirme", "min_similarity" : 0.2 } } } }
The match all query is a simple query that matches all documents in the index. So, to match all the documents, we would run the following query:
{ "query" : { "match_all" : {} } }
If we want to use index-time boosting for some field and want the match all query to take the index-time boosts into consideration during query execution, we can add the norms_field
property with the value of the boosted field. For example, if we boost the title
field during indexing and want the match all query to influence the score of the documents with it, the following query will have to be run:
{ "query" : { "match_all" : { "norms_field" : "title" } } }
The wildcard query is a query
that allows us to use the *
and ?
wildcards in the values we search for. Apart from that, the wildcard query is very similar to the term query in its body. To send a query that will match all the documents with the value of the term cr?me
, with ?
meaning any character, we will use:
{ "query" : { "wildcard" : { "title" : "cr?me" } } }
It will match the documents that have any of the terms matching cr?me
in the title
field. However, you can also include the boost
attribute with your wildcard query, which will affect the importance of each term that matches the given value. For example, if we want to change our previous query and give our term query a boost of 20.0
, we will send the following query:
{ "query" : { "wildcard" : { "title" : { "value" : "cr?me", "boost" : 20.0 } } } }
Please note that wildcard queries are not very performance oriented and should be avoided if possible; leading wildcards (terms starting with wildcards) should especially be avoided.
Please note that the wildcard query is rewritten by ElasticSearch, and because of that, ElasticSearch allows us to pass an additional parameter, controlling the rewrite method. However, for more details about that process, please go to the Query rewrite section later in this chapter.
The more like this query allows us to get documents that are similar to the provided text. ElasticSearch support a few parameters to define how more like this queries should work:
fields
: This is an array of fields that the query should be run against. It defaults to the _all
field.like_text
: This specifies a required parameter that holds the text to which we compare the documents.percent_terms_to_match
: This specifies the percentage of terms that must match for a document to be considered similar. It defaults to 0.30
, which translates to 30 percent.min_term_freq
: This is the minimum term frequency (for the terms in the documents) below which terms will be ignored. It defaults to 2
.max_query_terms
: This is the maximum number of terms that will be included in a generated query; it defaults to 25
.stop_words
: This specifies an array of words that will be ignored.min_doc_freq
: This specifies the minimum number of documents in which terms have to be present in order not to be ignored. It defaults to 5
.max_doc_freq
: This specifies the maximum number of documents in which terms may be present in order not to be ignored, but the default is for it to be unbounded.min_word_len
: This specifies the minimum length of a single word below which it will be ignored. It defaults to 0
.max_word_len
: This specifies the maximum length of a single word above which it will be ignored. It defaults to being unbounded.boost_terms
: This specifies the boost value that will be used when boosting each term; it defaults to 1
.boost
: This specifies the boost value that will be used when boosting a query. It defaults to 1
.analyzer
: This is the name of the analyzer that will be used to analyze the text we provided.An example more like this query could look like this:
{ "query" : { "more_like_this" : { "fields" : [ "title", "otitle" ], "like_text" : "crime and punishment", "min_term_freq" : 1, "min_doc_freq" : 1 } } }
The more like this field
query is similar to the more like this query but works only against a single field, and because of that, it doesn't support the fields
property. Instead of specifying fields that should be used for query analysis, we should wrap query parameters into the field name. So, our example query to a title
field would look like the following code:
{ "query" : { "more_like_this_field" : { "title" : { "like_text" : "crime and punishment", "min_term_freq" : 1, "min_doc_freq" : 1 } } } }
All the other parameters from the more like this query work the same for this type of query.
This is a query that allows us to find documents within a certain range and works for numerical fields as well as for string-based fields (it just maps to a different Apache Lucene query). The range query should be run against a single field, and the query parameters should be wrapped in the field name. The following parameters are supported:
from
: This is the lower bound of the range and defaults to the first value.to
: This is the upper bound of the range and defaults to unbounded.include_lower
: This specifies if the left side of the range must be inclusive or not. It defaults to true
.include_upper
: This specifies whether the right side of the range should be inclusive o not. It defaults to true
. boost
: This specifies the boost that will be given for the query.So, for example, if we would like to find all the books that have values ranging from 1700
to 1900
in the year
field, we could run the following query:
{ "query" : { "range" : { "year" : { "from" : 1700, "to" : 1900 } } } }
In some cases, ElasticSearch must rewrite your query into another query to allow efficient query execution. This happens, for example, with the prefix query; behind the scenes, ElasticSearch changes the prefix query to a logical disjunction of all possible tokens with this prefix. Because of the rewriting process, ElasticSearch will set a static score equal to the query boost for each of the documents returned by such queries, but we can change that.
In order to control query rewriting, we need to add the rewrite
property to our query with one of the following values:
scoring_boolean
: This rewrite method translates each generated term into a Boolean should clause. This method may be CPU-intensive (because the score for each term is calculated and kept), and for queries that have many terms, it may exceed the Boolean query limit.constant_score_boolean
: This is similar to scoring_boolean
, but less CPU-intensive because scoring is not computed, and instead, each term receives a score equal to the query boost.constant_score_filter
: This method rewrites the query using a filter for each generated term and marks all the documents for that filter. Matching documents are given a constant score equal to the query boost.top_terms_N
: This rewrite method translates each generated term into a Boolean should clause, but keeps only the N
number of top scoring terms. Scoring is calculated and maintained for each query.top_terms_boost_N
: This rewrite method translates each generated term into a Boolean should clause, but keeps only the N
number of top scoring terms. Scoring is calculated as the boost given for the query.When the rewrite
property is not set, it defaults to either constant_score_boolean
or constant_score_filter
depending on the query.
So our example prefix query with the rewrite
property could look like the following code:
{ "query" : { "prefix" : { "title" : { "value" : "cri", "boost" : 3.0, "rewrite" : "top_terms_10" } } } }