Chapter 3. Searching Your Data

In the previous chapter, we dived into Elasticsearch indexing. We learned a lot when it comes to data handling. We saw how to tune Elasticsearch schema-less mechanism and we now know how to create our own mappings. We also saw the core types of Elasticsearch and we used analyzers – both the one that comes out of the box with Elasticsearch and the one we defined ourselves. We used bulk indexing and we added additional internal information to our indices. Finally, we learned what segment merging is, how we can fine tune it, and how to use routing in Elasticsearch and what it gives us. This chapter is fully dedicated to querying. By the end of this chapter, you will have learned the following topics:

  • How to query Elasticsearch
  • What happens internally when queries are run
  • What are the basic queries in Elasticsearch
  • What are the compound queries in Elasticsearch that allow us to group other queries
  • How to use position aware queries – span queries
  • How to choose the right query for the job

Querying Elasticsearch

So far, when we havesearched our data, we used the REST API and a simple query or the GET request. Similarly, when we were changing the index, we also used the REST API and sent the JSON-structured data to Elasticsearch. Regardless of the type of operation we wanted to perform, whether it was a mapping change or document indexation, we used JSON structured request body to inform Elasticsearch about the operation details.

A similar situation happens when we want to send more than a simple query to Elasticsearch, we structure it using the JSON objects and send it to Elasticsearch in the request body. This is called the query DSL. In a broader view, Elasticsearch supports two kinds of queries: basic ones and compound ones. Basic queries, such as the term query, are used for querying the actual data. We will cover these in the Basic queries section of this chapter. The second type of query is the compound query, such as the bool query, which can combine multiple queries. We will cover these in the Compound queries section of this chapter.

However, this is not the whole picture. In addition to these two types of queries, certain queries can have filters that are used to narrow down your results with certain criteria. Filter queries don't affect scoring and are usually very efficient and easily cached.

To make it even more complicated, queries can contain other queries (don't worry; we will try to explain all this!). Furthermore, some queries can contain filters and others can contain both queries and filters. Although this is not everything, we will stick with this working explanation for now. We will go over this in greater detail in the Compound queries section in this chapter and the Filtering your results section in Chapter 4, Extending Your Querying Knowledge.

The example data

If not stated otherwise, the following mappings will be used for the rest of the chapter:

{
  "book" : {
    "properties" : {
      "author" : {
        "type" : "string"
      },
      "characters" : {
        "type" : "string"
      },
      "copies" : {
        "type" : "long",
        "ignore_malformed" : false
      },
      "otitle" : {
        "type" : "string"
      },
      "tags" : {
        "type" : "string",
        "index" : "not_analyzed"
      },
      "title" : {
        "type" : "string"
      },
      "year" : {
        "type" : "long",
        "ignore_malformed" : false,
        "index" : "analyzed"
      },
      "available" : {
        "type" : "boolean"
      }
    }
  }
}

The preceding mappings represent a simple library and were used to create the library index. One thing to remember is that Elasticsearch will analyze the string based fields if we don't configure it differently.

The preceding mappings were stored in the mapping.json file and, in order to create the mentioned library index, we can use the following commands:

curl -XPOST 'localhost:9200/library'
curl -XPUT 'localhost:9200/library/book/_mapping' -d @mapping.json

We also used the following sample data as the example ones for this chapter:

{ "index": {"_index": "library", "_type": "book", "_id": "1"}}
{ "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"characters": ["Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden"],"tags": ["novel"],"copies": 1, "available": true, "section" : 3}
{ "index": {"_index": "library", "_type": "book", "_id": "2"}}
{ "title": "Catch-22","author": "Joseph Heller","year": 1961,"characters": ["John Yossarian", "Captain Aardvark", "Chaplain Tappman", "Colonel Cathcart", "Doctor Daneeka"],"tags": ["novel"],"copies": 6, "available" : false, "section" : 1}
{ "index": {"_index": "library", "_type": "book", "_id": "3"}}
{ "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"characters": ["Sherlock Holmes","Dr. Watson", "G. Lestrade"],"tags": [],"copies": 0, "available" : false, "section" : 12}
{ "index": {"_index": "library", "_type": "book", "_id": "4"}}
{ "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true}

We stored our sample data in the documents.json file and we use the following command to index it:

curl -s -XPOST 'localhost:9200/_bulk' --data-binary @documents.json

This command runs bulk indexing. You can learn more about it in the Batch indexing to speed up your indexing process section in Chapter 2, Indexing Your Data.

A simple query

The simplest way to query Elasticsearch is to use the URI request query. We already discussed it in the Searching with the URI request query section of Chapter 1, Getting Started with Elasticsearch Cluster. For example, to search for the word crime in the title field, you could send a query using the following command:

curl -XGET 'localhost:9200/library/book/_search?q=title:crime&pretty' 

This is a very simple, but limited, way of submitting queries to Elasticsearch. If we look from the point of view of the Elasticsearch query DSL, the preceding query is a query_string query. It searches for the documents that have the term crime in the title field and can be rewritten as follows:

{
  "query" : { 
    "query_string" : { "query" : "title:crime" }
  }
}

Sending a query using the query DSL is a bit different, but still not rocket science. We send the GET (POST is also accepted in case your tool or library doesn't allow sending request body in HTTP GET requests) HTTP request to the _search REST endpoint as earlier and include the query in the request body. Let's take a look at the following command:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

As you can see, we used the request body (the -d switch) to send the whole JSON-structured query to Elasticsearch. The pretty request parameter tells Elasticsearch to structure the response in such a way that we humans can read it more easily. In response to the preceding command, we get the following output:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      }
    } ]
  }
}

Nice! We got our first search results with the query DSL.

Paging and result size

Elasticsearch allows us to control how many results we want to get (at most) and from which result we want to start. The following are the two additional properties that can be set in the request body:

  • from: This property specifies the document that we want to have our results from. Its default value is 0, which means that we want to get our results from the first document.
  • size: This property specifies the maximum number of documents we want as the result of a single query (which defaults to 10). For example, if we are only interested in aggregations results and don't care about the documents returned by the query, we can set this parameter to 0.

If we want our query to get documents starting from the tenth item on the list and fetch 20 documents, we send the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "from" :  9,
  "size" : 20,
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

Tip

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  • Log in or register to our website using your e-mail address and password
  • Hover the mouse pointer on the SUPPORT tab at the top
  • Click on Code Downloads & Errata
  • Enter the name of the book in the Search box
  • Select the book for which you're looking to download the code files
  • Choose from the drop-down menu where you purchased this book from
  • Click on Code Download

Once the file is downloaded, make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows
  • Zipeg / iZip / UnRarX for Mac
  • 7-Zip / PeaZip for Linux

Returning the version value

In addition to all the information returned, Elasticsearch can return the version of the document (we mentioned about versioning in Chapter 1, Getting Started with Elasticsearch Cluster. To do this, we need to add the version property with the value of true to the top level of our JSON object. So, the final query, which requests the version information, will look as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
     "version" : true,
     "query" : {
       "query_string" : { "query" : "title:crime" }
     }
}'

After running the preceding query, we get the following results:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_version" : 1,
      "_score" : 0.5,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      }
    } ]
  }
}

As you can see, the _version section is present for the single hit we got.

Limiting the score

For nonstandard use cases, Elasticsearch provides a feature that lets us filter the results on the basis of a minimum score value that the document must have to be considered a match. In order to use this feature, we must provide the min_score value at the top level of our JSON object with the value of the minimum score. For example, if we want our query to only return documents with a score higher than 0.75, we send the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "min_score" : 0.75,
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

We get the following response after running the preceding query:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

If you look at the previous examples, the score of our document was 0.5, which is lower than 0.75, and thus we didn't get any documents in response.

Limiting the score usually doesn't make much sense because comparing scores between the queries is quite hard. However, maybe in your case, this functionality will be needed.

Choosing the fields that we want to return

With the use of the fields array in the request body, Elasticsearch allows us to define which fields to include in the response. Remember that you can only return these fields if they are marked as stored in the mappings used to create the index, or if the _source field was used (Elasticsearch uses the _source field to provide the stored values and the _source field is turned on by default).

So, for example, to return only the title and the year fields in the results (for each document), send the following query to Elasticsearch:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "fields" : [ "title", "year" ],
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

In response, we get the following output:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "fields" : {
        "title" : [ "Crime and Punishment" ],
        "year" : [ 1886 ]
      }
    } ]
  }
}

As you can see, everything worked as we wanted to. There are four things we would like to share with you at this point, which are as follows:

  • If we don't define the fields array, it will use the default value and return the _source field if available.
  • If we use the _source field and request a field that is not stored, then that field will be extracted from the _source field (however, this requires additional processing).
  • If we want to return all the stored fields, we just pass an asterisk (*) as the field name.
  • From a performance point of view, it's better to return the _source field instead of multiple stored fields. This is because getting multiple stored fields may be slower compared to retrieving a single _source field.

Source filtering

In addition to choosing which fields are returned, Elasticsearch allows us to use so-called source filtering. This functionality allows us to control which fields are returned from the _source field. Elasticsearch exposes several ways to do this. The simplest source filtering allows us to decide whether a document should be returned or not. Consider the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "_source" : false,
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

The result retuned by Elasticsearch should be similar to the following one:

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5
    } ]
  }
}

Note that the response is limited to base information about a document and the _source field was not included. If you use Elasticsearch as a second source of data and content of the document is served from SQL database or cache, the document identifier is all you need.

The second way is similar to that described in the preceding fields, although we define which fields should be returned in the document source itself. Let's see that using the following example query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "_source" : ["title", "otitle"],
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

We wanted to get the title and the otitle document fields in the returned _source field. Elasticsearch extracted those values from the original _source value and included the _source field only with the requested fields. The whole response returned by Elasticsearch looked as follows:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "_source" : {
        "otitle" : "Преступлéние и наказáние",
        "title" : "Crime and Punishment"
      }
    } ]
  }
}

We can also use an asterisk to select which fields should be returned in the _source field; for example, title* will return values for the title field and for title10 (if we have such field in our data). If we have documents with nested parts, we can use notation with a dot; for example, title.* to select all the fields nested under the title object.

Finally, we can also specify explicitly which fields we want to include and which to exclude from the _source field. We can include fields using the include property and we can exclude fields using the exclude property (both of them are arrays of values). For example, if we want the returned _source field to include all the fields starting with the letter t but not the title field, we will run the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "_source" : { 
    "include" : [ "t*"], 
    "exclude" : ["title"] 
  },
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

Using the script fields

Elasticsearch allows us to use script-evaluated values that will be returned with the result documents (we will discuss Elasticsearch scripting capabilities in greater detail in the Scripting capabilities of Elasticsearch section in Chapter 6, Make Your Search Better). To use the script fields functionality, we add the script_fields section to our JSON query object and an object with a name of our choice for each scripted value that we want to return. For example, to return a value named correctYear, which is calculated as the year field minus 1800, we run the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "script_fields" : {
    "correctYear" : {
      "script" : "doc["year"].value - 1800"
    } 
  }, 
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

Note

By default, Elasticsearch doesn't allow us to use dynamic scripting. If you tried the preceding query, you probably got an error with information stating that the scripts of type [inline] with operation [search] and language [groovy] are disabled. To make this example work, you should add the script.inline: on property to the elasticsearch.yml file. However, this exposes a security threat. Make sure to read the Scripting capabilities of Elasticsearch section in Chapter 6, Make Your Search Better, to learn about the consequences.

Using the doc notation, like we did in the preceding example, allows us to catch the results returned and speed up script execution at the cost of higher memory consumption. We also get limited to single-valued and single term fields. If we care about memory usage, or if we are using more complicated field values, we can always use the _source field. The same query using the _source field looks as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "script_fields" : {
    "correctYear" : {
      "script" : "_source.year - 1800"
    } 
  }, 
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

The following response is returned by Elasticsearch with dynamic scripting enabled:

{
  "took" : 76,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : 0.5,
      "fields" : {
        "correctYear" : [ 86 ]
      }
    } ]
  }
}

As you can see, we got the calculated correctYear field in response.

Passing parameters to the script fields

Let's take a look at one more feature of the script fields - the passing of additional parameters. Instead of having the value 1800 in the equation, we can use a variable name and pass its value in the params section. If we do this, our query will look as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "script_fields" : {
    "correctYear" : {
      "script" : "_source.year - paramYear",
      "params" : {
        "paramYear" : 1800
      }
    } 
  }, 
  "query" : {
    "query_string" : { "query" : "title:crime" }
  }
}'

As you can see, we added the paramYear variable as part of the scripted equation and provided its value in the params section. This allows Elasticsearch to execute the same script with different parameter values in a slightly more efficient way.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset