Querying ElasticSearch

Up to now, most of the times we talked to ElasticSearch with the REST API using an HTTP request, we were using JSON-structured data to do that, regardless of whether it was a mappings change, alias creation, or document indexation. A similar situation arises when we want to send more than a simple query to ElasticSearch—we structure it using JSON objects and send it to ElasticSearch. This is called Query DSL. In a broader view, ElasticSearch supports two kinds of queries, basic ones and compound ones. Basic queries such as the term query are used just for querying. We will cover these in the Basic queries section in this chapter. The second type of query is the compound query, such as the bool query, which can combine multiple queries. We will cover these in the Compound queries section in this chapter.

However, this is not the entirety of the picture. In addition to these two types of queries, your query can have filter queries , which are used to narrow your results with certain criteria.

To make it even more complicated, queries can contain other queries (don't worry, we will try to explain most of it!). Furthermore, some queries can contain filters, and others can contain both queries and filters. Although this is not everything, we will stick with this working explanation for now. We will go over this in detail in the Compound queries and Filtering your results sections in this chapter.

Simple query

The simplest way to query ElasticSearch is to use the URI request query. For example, if we wanted to search for the word "crime" in the title field, we would send a query like this one:

curl -XGET 'localhost:9200/library/book/_search?q=title:crime&pretty=true'

If we look from the ElasticSearch Query DSL point of view, the simplest query is the term query, which searches for the documents that have a given term (a word) in a given field. For example, if we wanted to search for the term "crime" (please remember that the term query is not analyzed, and thus, you need to provide the exact term you are searching for) in the title field, we would send the following query to ElasticSearch:

{
 "query" : {
  "term" : { "title" : "crime" }
 }
}

But, how do we query our data? We send the GET HTTP request to the _search REST end point, pointing to the index/type we want to search (of course, we can omit type, index, or both at the same time). So, if we wanted to search our example library index, we would use the following command:

curl -XGET 'localhost:9200/library/book/_search?pretty=true' -d '{
 "query" : {
  "term" : { "title" : "crime" }
 }
}'

As you can see, we used the request body (the -d switch) to send the whole JSON-structured query to ElasticSearch. The pretty=true request parameter tells ElasticSearch to structure the response in a way such that we humans can read it more easily. In response, we got the following text:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.19178301, "_source" : { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true}
    } ]
  }
}

As we said earlier, a query can be directed to a particular index and type, but this is not the only possibility. We can query several indices in parallel or query one index regardless of the type. Let's sum up the possible call types and see what the addressing looks like:

  1. Request to index and type:
    curl -XGET 'localhost:9200/library/book/_search' -d @query.json
  2. Request to index and all types in it:
    curl -XGET 'localhost:9200/library/_search' -d @query.json
  3. Request to all indices:
    curl -XGET 'localhost:9200/_search' -d @query.json
  4. Request to few indices:
    curl -XGET 'localhost:9200/library,bookstore/_search' -d @query.json
  5. Request to multiple indices and multiple types in them:
    curl -XGET 'localhost:9200/library,bookstore/book,recipes/_search' -d @query.json

Neat! We got our first search results!

Paging and results size

As we would expect, ElasticSearch allows us to control how many results we want to get (at most) and from which result we want to start. There are two additional properties that can be set in the request body:

  • from: This specifies from which document we want to have our results and defaults to 0, which means we want our results from the first document
  • size: This specifies the maximum number of documents we want as a result of a single query (defaults to 10)

So, if we wanted our query to get documents starting from the tenth on the list and get 20 of them, we would send the following query:

{
 "from" :  9,
 "size" : 20,
 "query" : {
  "term" : { "title" : "crime" }
 }
}

Returning the version

In addition to all the information returned, ElasticSearch can return the version of the document. To do that, we need to add the version property with the value true to our JSON object (to its top level) so it looks like the following mapping:

{
 "version" : true,
 "query" : {
 "term" : { "title" : "crime" }
 }
}

After running it, we get the following results:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_version" : 1,
      "_score" : 0.19178301, "_source" : { "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true}
    } ]
  }
}

As you can see, the _version section is present for the single hit we got.

Limiting the score

For nonstandard use cases, ElasticSearch provides a feature that lets one filter the results on the basis of the minimum score value that the document must have to be considered a match. In order to use it, we must provide the min_score property on the top level of our JSON object with the value of the minimum score. For example, if we wanted our query to only return documents with scores higher than 0.75, we would send the following query:

{
 "min_score" : 0.75,
 "query" : {
  "term" : { "title" : "crime" }
 }
}

We get the following response after running the preceding query:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Look at the previous examples; the score of our document was 0.19178301, which is lower than 0.75, and thus, we didn't get any document in response.

Limiting the score doesn't make much sense, usually, because comparing scores between queries is quite hard. However, maybe in your case this functionality will be needed.

Choosing the fields we want to return

With the use of the fields array in the request body, ElasticSearch allows us to define which fields should be included in the response. Please remember that you can only return fields that are marked as stored in the mappings used to create the index or if the _source field was used (ElasticSearch will use the _source field to provide the stored values). So, for example, if we want to return only the title and year fields in the results (for each document), we would send the following query to ElasticSearch:

{
 "fields" : [ "title", "year" ],
 "query" : {
  "term" : { "title" : "crime" }
 }
}

And, in response, we would get the following result:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.19178301,
      "fields" : {
        "title" : "Crime and Punishment",
        "year" : 1886
      }
    } ]
  }
}

As you can see, everything worked as we wanted it to behave.

There are three things I would like to share with you:

  • If we don't define the fields array, it will use the default value and return the _source field if available
  • If we use the _source field and request a field that is not stored, that field will be extracted from the _source field (however, please remember that it requires additional processing)
  • If you want to return all stored fields just pass * as the field name

Note

Please note that if you use the _source field, from the performance point of view, it's better to return the _source field instead of multiple stored fields.

Partial fields

In addition to choosing what fields are returned, ElasticSearch allows the use of the so-called partial fields . Partial fields allow us to control how fields are loaded from the _source field. ElasticSearch exposes the include and exclude properties of the partial_fields object, so we can include and exclude fields on the basis of those properties. For example, for our query to include fields that start with titl and exclude the ones that start with chara, we would send the following query:

{
 "partial_fields" : {
  "partial1" : {
   "include" : [ "titl*" ],
   "exclude" : [ "chara*" ]
  }
 },
 "query" : {
  "term" : { "title" : "crime" }
 }
}

Using script fields

ElasticSearch allows us to use script evaluated values to be returned with result documents. In order to use script fields, we need to add the script_fields section to our JSON query object and an object with the name we choose for each scripted value we want to return. For example, to return a value named correctYear that is calculated as the year field minus 1800, we would run the following query:

{
 "script_fields" : {
  "correctYear" : {
   "script" : "doc['year'].value – 1800"
  }
 },
 "query" : {
  "term" : { "title" : "crime" }
 }
}

However, if you run the preceding query against our sample data, you get an exception in the response as we don't store the year field. Yes, that's right. Only stored fields or the ones available in _source can be used. So, we will modify our query to use the _source field. After modifications, it should look like the following code:

{
 "script_fields" : {
  "correctYear" : {
   "script" : "_source.year – 1800"
  }
 },
 "query" : {
  "term" : { "title" : "crime" }
 }
}

Notice that we didn't use the value part of the equation. The following response will be returned by ElasticSearch for this query:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.19178301,
      "fields" : {
        "correctYear" : 86
      }
    } ]
  }
}

As you can see, the correctYear field is in the response.

Passing parameters to script fields

Let's look at one more feature of script fields, passing parameters. Instead of having the value 1800 in the equation, we can use a variable name and pass its value in the parameters section. If we did that, our query would look as follows:

{
 "script_fields" : {
  "correctYear" : {
   "script" : "_source.year – paramYear",
   "params" : {
    "paramYear" : 1800
   }
  }
 },
 "query" : {
  "term" : { "title" : "crime" }
 }
}

As you can see, we added the paramYear variable as a part of the scripted equation and we provided its value in the params section.

You can find more about script usage at the end of this chapter in the Using scripts section.

Choosing the right search type (advanced)

ElasticSearch allows us to choose how we want our query to be processed internally. This is exposed to the end user because there are different situations where different search types are appropriate. To control how queries are executed, we can pass the search_type request parameter and set it to one of the following values:

  • query_and_fetch: This is usually the fastest and the simplest search type implementation. The query is executed against all the needed shards in parallel, and all the shards return results equal in number to the value of the size parameter. The maximum number of returned documents will be equal to the value of the size parameter times the number of shards.
  • query_then_fetch: In the first step, the query is executed to get the information needed to sort and rank documents. Only then are the relevant shards for the actual content of the documents fetched. Different from the query_and_fetch value, the maximum number of results returned by this query type will be equal to the size parameter.
  • dfs_query_and_fetch: This is similar to the query_and_fetch search type, but in addition to what query_and_fetch does, the initial query phase is executed and calculates the distributed term frequencies to allow more precise scoring of returned documents.
  • dfs_query_then_fetch: This is similar to the query_then_fetch search type, but in addition to what query_then_fetch does, the initial query phase is executed and calculates the distributed term frequencies to allow more precise scoring of returned documents.
  • count: This is a special search type that only returns the number of documents that matched the query.
  • scan: This is another special search type. The scan type should be only used if you expect your query to return a large number of results. It differs a bit from the usual queries because, after sending the first request, ElasticSearch responds with the scroll identifier and all the other queries need to be run against the _search/scroll REST end point and need to send the returned scroll identifier in the request body. You can find more about this functionality in the Why is the result on the later pages slow section in Chapter 8, Dealing with Problems.

So, if we wanted to use the simplest search type, we would run the following command:

curl -XGET 'localhost:9200/library/book/_search?pretty=true&search_type=query_and_fetch' -d '{
 "query" : {
  "term" : { "title" : "crime" }
 }
}'

Search execution preference (advanced)

In addition to all the previous possibilities of controlling your search, you have one more; you can control what types of shards the search will be executed on. By default, ElasticSearch uses both shards and replicas, available both on the node we've sent the request on and on the other nodes in the cluster. And the default behavior is mostly the proper method of shard preference for queries. But there may be times when we would want to change the default behavior. To do that, we can set the preference request parameter to one of the following values:

  • _primary: This specifies that the operation will be only executed on primary shards, so replicas won't be used.
  • _primary_first: This specifies that the operation will be executed on primary shards if they are available. If not, it will be executed on other shards.
  • _local: This specifies the operation will only be executed on the shards available on the node we are sending the request to (if possible).
  • _only_node:node_id: This specifies that the operation will be executed on the node with the provided node identifier.
  • A custom value: This can be any custom string value that may be passed. Requests with the same values provided will be executed on the same shards.

    For example, if we wanted to execute a query only on local shards, we would run the following command:

    curl -XGET 'localhost:9200/library/_search?preference=_local' -d '{
     "query" : {
      "term" : { "title" : "crime" }
     }
    }'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset