Basic queries

So, we now know what an ElasticSearch query is, how to construct it, and finally, how to send it using an HTTP request. What we don't know yet is what kind of queries ElasticSearch exposes, and thus, what we can use in order to achieve the desired results. In the next few pages of this chapter, we will try to learn which basic queries ElasticSearch allows us to use and what we can do with them.

The term query

The term query is one of the simplest queries in ElasticSearch and just matches any document that has a term in a given field. You are familiar with this query type because we used it already, but just to have all the query types in one place. The simplest term query is as follows:

{
 "query" : {
  "term" : {
   "title" : "crime"
  }
 }
}

It will match the documents that have the term "crime" in the title field. Please remember that the term query is not analyzed, so you need to provide the exact term that will match the term in the indexed document. However, you can also include the boost attribute in your term query; this will affect the importance of the given term. For example, if we wanted to change our previous query and give our term query a boost of 10.0, we would send the following query:

{
 "query" : {
  "term" : {
   "title" : {
    "value" : "crime",
    "boost" : 10.0
   }
  }
 }
}

As you can see, the query changes a bit. Instead of a simple term value we nest a new JSON object, which contains the value property and the boost property. The value of the value property should contain the term we are interested in and the boost property is the boost value we want to use.

The terms query

This is a query that allows us to match documents that have certain terms in their contents. For example, let's say that we want to get all the documents that have the terms "novel" or "book" in the tags field. To achieve that, we could run the following query:

{
 "query" : {
  "terms" : {
   "tags" : [ "novel", "book" ],
   "minimum_match" : 1
  }
 }
}

The preceding query returns all the documents that have one or both of the searched terms in the tags field. Why is that? Because we set the minimum_match property to 1, which basically means that one term should be matched. If we wanted the query to match only a document with both the provided terms, we would set the minimum_match propert to 2.

The match query

The match query takes the values given in the query parameter, analyzes them, and constructs the appropriate query out of them. When using a match query, ElasticSearch will choose the proper analyzer for a field we've chosen, so you can be sure that the terms passed to the match query will be processed by the same analyzer that was used during indexing. Please remember that the match query (and as further explained, the multi match query) doesn't support Lucene query syntax (discussed in the The query string query section, later in this chapter); however, it fits perfectly as a query handler for your search box. The simplest (and the default) match query can look like this:

{
 "query" : {
  "match" : {
   "title" : "crime and punishment"
  }
 }
}

The preceding query would match all the documents that have the terms "crime" or "and" or "punishment" in the title field. However, the preceding query is only the simplest one; there are multiple types of match query. They are covered in the following sections.

The Boolean match query

The Boolean match query is a query that analyzes the provided text and makes a Boolean query out of it. There are a few parameters that allow us to control the behavior of Boolean match queries:

  • operator: This can take the value of or or and and control what Boolean operator is used to connect the created Boolean clauses. The default value is or.
  • analyzer: This specifies the name of the analyzer that will be used to analyze the query text and defaults to the default analyzer.
  • fuzziness: Providing the value of this parameter allows one to construct fuzzy queries. It should take values from 0.0 to 1.0 for a string object. While constructing fuzzy queries, this parameter will be used to set the similarity.
  • prefix_length: This allows one to control the behavior of the fuzzy query. For more information on the value of this parameter, please see the The fuzzy like this query section in this chapter.
  • max_expansions: This allows one to control the behavior of the fuzzy query. For more information on the value of this parameter, please see the The fuzzy like this query section in this chapter.
  • The parameters should be wrapped in the name of the field we are running the query against. So, if we wanted to run a sample Boolean match query against the title field, we could send a query like so:
    {
     "query" : {
      "match" : {
       "title" : {
        "query" : "crime and punishment",
        "operator" : "and"
       }
      }
     }
    }

The phrase match query

A phrase match query is similar to the Boolean query, but instead of constructing the Boolean clauses from the analyzed text, it constructs a phrase query. The following parameters are available:

  • slop: This is an integer value that defines how many unknown words can be put between terms in the text query for a match to be considered a phrase.
  • analyzer: This specifies the name of the analyzer that will be used to analyze the query text and defaults to the default analyzer.

A sample phrase match query against the title field could look like the following code:

{
 "query" : {
  "match_phrase" : {
   "title" : {
    "query" : "crime and punishment",
    "slop" : 1
   }
  }
 }
}

The match phrase prefix query

The last type of the match query is the match phrase prefix query. This query is almost the same as the prefix match query, but in addition, it allows prefix matches on the last term in the query text. Also, in addition to the parameters exposed by the match phrase query, it exposes an additional one, the max_expansions parameter, which controls how many prefixes the last terms will be rewritten to. Our sample query changed to the match phrase prefix query could look like this:

{
 "query" : {
  "match_phrase_prefix" : {
   "title" : {
    "query" : "crime and punishment",
    "slop" : 1,
    "max_expansions" : 20
   }
  }
 }
}

The multi match query

This is the same as the match query, but instead of running against a single field, it can be run against multiple fields with the use of the fields parameter. Of course, all the parameters you use with the match query can be used with the multi match query. So, if we want to modify our match query to be run against the title and otitle fields, we could run the following query:

{
 "query" : {
  "multi_match" : {
   "query" : "crime punishment",
   "fields" : [ "title", "otitle" ]
  }
 }
}

The query string query

In comparison with the other queries available , the query string query supports full Apache Lucene query syntax, so it uses a query parser to construct an actual query using the provided text. A sample query string query can look like this:

{
 "query" : {
  "query_string" : {
   "query" : "title:crime^10 +title:punishment -otitle:cat +author:(+Fyodor +dostoevsky)",
   "default_field" : "title"
  }
 }
}

You may wonder what that weird syntax in the query parameter is; we will get to it in the Lucene query syntax part of the query string query description.

As with most of the queries in ElasticSearch, the query string query provides a few parameters that allow us to control query behavior:

  • query: This specifies the query text.
  • default_field: This specifies the default field the query will be executed against. It defaults to the index.query.default_field property, which is by default set to _all.
  • default_operator: This specifies the default logical operator (or/and) used when no operator is specified. The default value of this parameter is or.
  • analyzer: This specifies the name of the analyzer used to analyze the query provided in the query parameter.
  • allow_leading_wildcard: This specifies whether a wildcard allowed as the first character of a term; it defaults to true.
  • lowercase_expand_terms: This specifies whether terms rewritten by the query are lowercased. It defaults to true.
  • enable_position_increments: This specifies whether position increments are turned on in the result query. It defaults to true.
  • fuzzy_prefix_length: This is the prefix length for generated fuzzy queries, and it defaults to 0. To learn more about it, please look at the The fuzzy query section.
  • fizzy_min_sim: This specifies the minimum similarity for fuzzy queries and defaults to 0.5. To learn more about it, please look at the The fuzzy query section.
  • phrase_slop: This specifies the phrase slop and defaults to 0. To learn more about it, please look at the The phrase match query section.
  • boost: This is the boost value used and defaults to 1.0.
  • analyze_wildcard: This specifies whether the wildcard characters should be analyzed. It defaults to true.
  • auto_generate_phrase_queries: This specifies whether phrase queries should be automatically generated. It defaults to false.
  • minimum_should_match: This controls how many of the generated Boolean clauses should match to consider a hit for a given document. The value should be provided as a percentage, for example 50%, which would mean that at least 50 percent of the given terms should match.
  • lenient: This parameter can take the value of true or false. If it is set to true, format-based failures will be ignored.

Please note that the query string query can be rewritten by ElasticSearch, and because of that, ElasticSearch allows us to pass additional parameters that control the rewrite method. However, for more details about that process, see the Query rewrite section later in this chapter.

Lucene query syntax

As we have already discussed, Apache Lucene is the full text search library on top of which ElasticSearch is built. Because of that, some of the queries in ElasticSearch (such as the one currently discussed) support Lucene query parsers syntax—the language that allows you to construct queries. Let's take a look at it and discuss some basic features of it. To read about full Lucene query syntax, please visit http://lucene.apache.org/core/3_6_1/queryparsersyntax.html.

A query we pass to Lucene is divided into terms and operators by the query parser. Let's start with the terms; you can distinguish them into two types, single terms and phrases. For example, to query the term "book" in the title field, we would pass the following query:

title:book

To query the phrase "elasticsearch book" in the title field, we would pass the following query:

title:"elasticsearch book"

You may have noticed the name of the field in the beginning and the term or phrase later.

As we have already said, Lucene query syntax supports operators. For example, the + operator tells Lucene that the given part must be matched against the document to consider that document a match, while the - operator is the opposite, which means that such a part of the query can't be present in the document. A part of the query without the + or - operator will be treated as part of the query that can be matched, but it is not mandatory. So, if we wanted to find a document with the term "book" in the title field and without the term "cat" in the description field, we would pass the following query:

+title:book -description:cat

We can also group multiple terms with parentheses, for example, the following query:

title:(crime punishment)

We can also boost parts of the query with the ^ operator and the boost value after it. For example, the following query:

title:book^4

Explaining the query string

So now that we know the basics of the Lucene query syntax, let's get back to the query we sent using the query_string query. As you can see, we wanted to get the documents that may have the term "crime" in the title field, and such documents should be boosted with the value of 10. Next, we want only the documents that have the term "punishment" in the title field, and we don't want documents with the term "cat" in the otitle field. Finally, we tell Lucene that we only want the documents that have the terms "fyodor" and "dostoevsky" in the author field.

Running query string query against multiple fields

It is possible to run the query string query against multiple fields. In order to do that, one needs to provide the fields parameter in the query body, which should hold an array of field names. There are two methods of running the query string query against multiple fields; the default method will use the Boolean query to make queries, and the other method can use the DisMax query.

Note

DisMax is an abbreviation of Disjunction Max. The "Disjunction" part refers to the fact that the search is executed across multiple fields and the fields can be given different boost weights. The "Max" part means that only the maximum score for a given term will be included in a final document score, not the sum of all the scores from all fields that have the matched term (which is what the a simple Boolean query would do).

In order to use the DisMax query, one should add the use_dis_max property in the query body and set it to true. A sample query can look like this:

{
 "query" : {
  "query_string" : {
   "query" : "crime punishment",
   "fields" : [ "title", "otitle" ],
   "use_dis_max" : true
  }
 }
}

The field query

The field query is a simplified version of the query string query that we just discussed. I would only like to find all the documents that have the term "crime" in the title field, that may have the term "nothing" and that don't have the term "let" in the same field. For that, we could run the following query:

{
 "query" : {
  "field" : {
   "title" : "+crime nothing -let"
  }
 }
}

You can also apply all the properties that apply to the query string query. To do that, we should wrap all the parameters in the field name and pass the actual query in the query parameter. So, the preceding query with the boost parameter added would look like this:

{
 "query" : {
  "field" : {
   "title" : {
    "query" : "+crime nothing -let",
    "boost" : 20.0
   }
  }
 }
}

The identifiers query

This is a simple query that filters the returned documents to only those with the provided identifiers. It works on the internal _uid field, so it doesn't require the _id field to be enabled. The simplest version of such a query could look like the following query:

{
 "query" : {
  "ids" : {
   "values" : [ "10", "11", "12", "13" ]
  }
 }
}

This query would only return documents that have one of the identifiers present in the values array. We can complicate the identifiers query a bit and also limit the documents on the basis of their type. For example, if we want to only include documents from the book type, we could send the following query:

{
 "query" : {
  "ids" : {
   "type" : "book",
   "values" : [ "10", "11", "12", "13" ]
  }
 }
}

The prefix query

The prefix query is similar to the term query in terms of configuration and to the multi term query when looking into its logic. The prefix query allows us to match documents that have a value in a certain field that starts with a given prefix. For example, if we want to find all the documents that have values starting with cri in the title field, we could run the following query:

{
 "query" : {
  "prefix" : {
   "title" : "cri"
  }
 }
}

As with the term query, you can also include the boost attribute with your prefix query; this will affect the importance of the given prefix. For example, if we wanted to change our previous query and give it a boost of 3.0, we would send the following query:

{
 "query" : {
  "prefix" : {
   "title" : {
    "value" : "cri",
    "boost" : 3.0
   }
  }
 }
}

Note

Please note that the prefix query is rewritten by ElasticSearch, and because of that, ElasticSearch allows us to pass an additional parameter, controlling the rewrite method. However, for more details about that process please see the Query rewrite section later in this chapter.

The fuzzy like this query

The fuzzy like this query is similar to the more like this query. It finds all the documents that are similar to the provided text but works a bit differently from the more like this query because it makes use of fuzzy strings and picks the best differencing terms produced. For example, if we want to run a fuzzy like this query against the title and otitle fields and find all the documents similar to the crime punishment query, we could run the following query:

{
 "query" : {
  "fuzzy_like_this" : {
   "fields" : ["title", "otitle"],
   "like_text" : "crime punishment"
  }
 }
}

The following query parameters are supported:

  • fields: This is an array of fields that the query should be run against. It defaults to the _all field.
  • like_text: This is a required parameter that holds the text we compare the documents to.
  • ignore_tf: This specifies whether term frequencies be ignored; this parameter defaults to false.
  • max_query_terms: This specifies the maximum number of query terms that will be included in a generated query. It defaults to 25.
  • min_similarity: This specifies the minimum similarity that differencing terms should have. It defaults to 0.5.
  • prefix_length: This specifies the length of the common prefix of the differencing terms. It defaults to 0.
  • boost: This is the boost value that will be used when boosting queries. It defaults to 1.
  • analyzer: This specifies the name of the analyzer that will be used to analyze the text we provided.

The fuzzy like this field query

The fuzzy like this field query is similar to the fuzzy like this query but works only against a single field, and because of that, it doesn't support the fields property. Instead of specifying the fields that should be used for query analysis, we should wrap the query parameters into the field name. Our sample query to a title field should look like the following code:

{
 "query" : {
  "fuzzy_like_this_field" : {
   "title" : {
    "like_text" : "crime and punishment"
   }
  }
 }
}

All the other parameters from the fuzzy like this query work the same for this type of query.

The fuzzy query

The third type of fuzzy query matches documents on the basis of the edit distance algorithm that is calculated on the terms we provide against the searched documents. This query can be expensive when it comes to CPU resources but can help us when we need fuzzy matching, for example, when users make spelling mistakes. In our example, let's assume that, instead of crime, our user enters cirme into the search box and we would like to run the simplest form of fuzzy query. Such a query could look like this:

{
 "query" : {
  "fuzzy" : {
   "title" : "cirme"
  }
 }
}

And the response for this query would be as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.625,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.625, "_source" : { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true}
    } ]
  }
}

As you can see, even though we made a typo, ElasticSearch managed to find the document we were interested in.

You can control the fuzzy query behavior by using the following parameters:

  • value: This specifies the actual query (in case we want to pass more parameters).
  • boost: This specifies the boost value for the query. It defaults to 1.0.
  • min_similarity: This specifies the minimum similarity for a term to be counted as a match. In the case of string fields, this value should be between 0 and 1, both inclusive. For numeric fields, this value can be greater than one, for example, for a query with value equal to 20 and min_similarity set to 3, we would get values from 17 to 23. For date fields, we can have min_similarity values that include 1d, 2d, and 1m. These values correspond to one day, two days, and one month, and so on.
  • prefix_length: This is the length of the common prefix of the differencing terms, which defaults to 0.
  • max_expansions: This specifies the number of terms the query will be expanded to. The default value is unbounded.

The parameters should be wrapped in the name of the field we are running the query against. So, if we would like to modify the previous query and add additional parameters, the query could look like the following code:

{
 "query" : {
  "fuzzy" : {
   "title" : {
    "value" : "cirme",
    "min_similarity" : 0.2
   }
  }
 }
}

The match all query

The match all query is a simple query that matches all documents in the index. So, to match all the documents, we would run the following query:

{
 "query" : {
  "match_all" : {}
 }
}

If we want to use index-time boosting for some field and want the match all query to take the index-time boosts into consideration during query execution, we can add the norms_field property with the value of the boosted field. For example, if we boost the title field during indexing and want the match all query to influence the score of the documents with it, the following query will have to be run:

{
 "query" : {
  "match_all" : {
   "norms_field" : "title"
  }
 }
}

The wildcard query

The wildcard query is a query that allows us to use the * and ? wildcards in the values we search for. Apart from that, the wildcard query is very similar to the term query in its body. To send a query that will match all the documents with the value of the term cr?me, with ? meaning any character, we will use:

{
 "query" : {
  "wildcard" : {
   "title" : "cr?me"
  }
 }
}

It will match the documents that have any of the terms matching cr?me in the title field. However, you can also include the boost attribute with your wildcard query, which will affect the importance of each term that matches the given value. For example, if we want to change our previous query and give our term query a boost of 20.0, we will send the following query:

{
 "query" : {
  "wildcard" : {
   "title" : {
    "value" : "cr?me",
    "boost" : 20.0
   }
  }
 }
}

Note

Please note that wildcard queries are not very performance oriented and should be avoided if possible; leading wildcards (terms starting with wildcards) should especially be avoided.

Please note that the wildcard query is rewritten by ElasticSearch, and because of that, ElasticSearch allows us to pass an additional parameter, controlling the rewrite method. However, for more details about that process, please go to the Query rewrite section later in this chapter.

The more like this query

The more like this query allows us to get documents that are similar to the provided text. ElasticSearch support a few parameters to define how more like this queries should work:

  • fields: This is an array of fields that the query should be run against. It defaults to the _all field.
  • like_text: This specifies a required parameter that holds the text to which we compare the documents.
  • percent_terms_to_match: This specifies the percentage of terms that must match for a document to be considered similar. It defaults to 0.30, which translates to 30 percent.
  • min_term_freq: This is the minimum term frequency (for the terms in the documents) below which terms will be ignored. It defaults to 2.
  • max_query_terms: This is the maximum number of terms that will be included in a generated query; it defaults to 25.
  • stop_words: This specifies an array of words that will be ignored.
  • min_doc_freq: This specifies the minimum number of documents in which terms have to be present in order not to be ignored. It defaults to 5.
  • max_doc_freq: This specifies the maximum number of documents in which terms may be present in order not to be ignored, but the default is for it to be unbounded.
  • min_word_len: This specifies the minimum length of a single word below which it will be ignored. It defaults to 0.
  • max_word_len: This specifies the maximum length of a single word above which it will be ignored. It defaults to being unbounded.
  • boost_terms: This specifies the boost value that will be used when boosting each term; it defaults to 1.
  • boost: This specifies the boost value that will be used when boosting a query. It defaults to 1.
  • analyzer: This is the name of the analyzer that will be used to analyze the text we provided.

An example more like this query could look like this:

{
 "query" : {
  "more_like_this" : {
   "fields" : [ "title", "otitle" ],
   "like_text" : "crime and punishment",
   "min_term_freq" : 1,
   "min_doc_freq" : 1
  }
 }
}

The more like this field query

The more like this field query is similar to the more like this query but works only against a single field, and because of that, it doesn't support the fields property. Instead of specifying fields that should be used for query analysis, we should wrap query parameters into the field name. So, our example query to a title field would look like the following code:

{
 "query" : {
  "more_like_this_field" : {
   "title" : {
    "like_text" : "crime and punishment",
    "min_term_freq" : 1,
    "min_doc_freq" : 1
   }
  }
 }
}

All the other parameters from the more like this query work the same for this type of query.

The range query

This is a query that allows us to find documents within a certain range and works for numerical fields as well as for string-based fields (it just maps to a different Apache Lucene query). The range query should be run against a single field, and the query parameters should be wrapped in the field name. The following parameters are supported:

  • from: This is the lower bound of the range and defaults to the first value.
  • to: This is the upper bound of the range and defaults to unbounded.
  • include_lower: This specifies if the left side of the range must be inclusive or not. It defaults to true.
  • include_upper: This specifies whether the right side of the range should be inclusive o not. It defaults to true.
  • boost: This specifies the boost that will be given for the query.

So, for example, if we would like to find all the books that have values ranging from 1700 to 1900 in the year field, we could run the following query:

{
 "query" : {
  "range" : {
   "year" : {
    "from" : 1700,
    "to" : 1900
   }
  }
 }
}

Query rewrite

In some cases, ElasticSearch must rewrite your query into another query to allow efficient query execution. This happens, for example, with the prefix query; behind the scenes, ElasticSearch changes the prefix query to a logical disjunction of all possible tokens with this prefix. Because of the rewriting process, ElasticSearch will set a static score equal to the query boost for each of the documents returned by such queries, but we can change that.

In order to control query rewriting, we need to add the rewrite property to our query with one of the following values:

  • scoring_boolean: This rewrite method translates each generated term into a Boolean should clause. This method may be CPU-intensive (because the score for each term is calculated and kept), and for queries that have many terms, it may exceed the Boolean query limit.
  • constant_score_boolean: This is similar to scoring_boolean, but less CPU-intensive because scoring is not computed, and instead, each term receives a score equal to the query boost.
  • constant_score_filter: This method rewrites the query using a filter for each generated term and marks all the documents for that filter. Matching documents are given a constant score equal to the query boost.
  • top_terms_N: This rewrite method translates each generated term into a Boolean should clause, but keeps only the N number of top scoring terms. Scoring is calculated and maintained for each query.
  • top_terms_boost_N: This rewrite method translates each generated term into a Boolean should clause, but keeps only the N number of top scoring terms. Scoring is calculated as the boost given for the query.

When the rewrite property is not set, it defaults to either constant_score_boolean or constant_score_filter depending on the query.

So our example prefix query with the rewrite property could look like the following code:

{
 "query" : {
  "prefix" : {
   "title" : {
    "value" : "cri",
    "boost" : 3.0,
    "rewrite" : "top_terms_10"
   }
  }
 }
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset