Sorting data

So far we've run our queries and got the results in the order determined by the score of each document. However, it is not enough for all the use cases. It is really handy to be able to sort our results on the basis of the field values. For example, when you are searching logs or time-based data in general, you probably want to have the most recent data first. In addition to that, Elasticsearch allows us to control how the document such be sorted not only using field values, but also using more sophisticated sorting like ones that use scripts or sorting on fields that have multiple values. We will cover all that in this section.

Default sorting

Let's look at the following query that returns all the books with at least one of the specified words:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "terms" : {
      "title" : [ "crime", "front", "punishment" ]
    }
  }
}'

Under the hood, we can imagine that Elasticsearch sees the preceding query as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "terms" : {
      "title" : [ "crime", "front", "punishment" ]
    }
  },
  "sort" : { "_score" : "desc" }
}'

Look at the highlighted section in the preceding query. This is the default sorting used by Elasticsearch. For better visibility, we can change the formatting slightly and show the highlighted fragment as follows:

"sort" : [
  { "_score" : "desc" }
]

The preceding section defines how the documents should be sorted in the results list. In this case, Elasticsearch will show the documents with the highest score on top of the results list. The simplest modification is to reverse the ordering by changing the sort section to the following one:

 "sort" : [
   { "_score" : "asc" }
 ]

Selecting fields used for sorting

Default sorting is boring, isn't it? So, let's change it to sort on the basis of the values of the fields present in the documents. Let's choose the title field, which means that the sort section of our query will look as follows:

"sort" : [
  { "title" : "asc" }
]

Unfortunately, this doesn't work as expected. Although Elasticsearch sorted the documents, the ordering is somewhat strange. Look closely at the response. With every document, Elasticsearch returns information about the sorting; for example, for the Crime and Punishment book, the returned document looks like the following code:

    {
      "_index" : "library",
      "_type" : "book",
      "_id" : "4",
      "_score" : null,
      "_source" : {
        "title" : "Crime and Punishment",
        "otitle" : "Преступлéние и наказáние",
        "author" : "Fyodor Dostoevsky",
        "year" : 1886,
        "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
        "tags" : [ ],
        "copies" : 0,
        "available" : true
      },
      "sort" : [ "punishment" ]
    }

If you compare the title field and the returned sorting information, everything should be clear. Elasticsearch, during the analysis process, splits the field into several tokens. Since sorting is done using a single token, Elasticsearch chooses one of the produced tokens. It does the best that it can by sorting these tokens alphabetically and choosing the first one. This is the reason why, in the sorting value, we find only a single word instead of the whole content of the title field. If you would like to see how Elasticsearch behaves when using different fields for sorting, you can try fields such as copies:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
 "query" : { 
  "terms" : {
   "title" : [ "crime", "front", "punishment" ]
  } 
 },
 "sort" : [
  { "copies" : "asc" }
 ]
}'

In general, it is a good idea to have a not analyzed field for sorting. We can use fields with multiple values for sorting, but, in most cases, it doesn't make much sense and has limited usage.

As an example of using two different fields, one for sorting and another for searching, let's change our title field. The changed title field definition will look as follows:

"title" : {
  "type": "string",
  "fields": {
    "sort": { "type" : "string", "index": "not_analyzed" }
  }
}

After changing the title field in the mappings (we've used the same mappings as in Chapter 3, Searching Your Data) and re-indexing the data, we can try sorting the title.sort field and see whether it works. To do this, we will need to send the following query:

{
  "query" : { 
    "match_all" : { }
  },
  "sort" : [
    {"title.sort" : "asc" }
  ]
}

Now, it works properly. As you can see, we used the new field, the title.sort one. We set it as not to be analyzed, so there is a single token for that field in the index of Elasticsearch.

Sorting mode

In the response from Elasticsearch, every document contains information about the value used for sorting. For example, let's look at one of the documents returned by the query in which we used the title field for sorting:

    {
      "_index" : "library",
      "_type" : "book",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "title" : "All Quiet on the Western Front",
        "otitle" : "Im Westen nichts Neues",
        "author" : "Erich Maria Remarque",
        "year" : 1929,
        "characters" : [ "Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden" ],
        "tags" : [ "novel" ],
        "copies" : 1,
        "available" : true,
        "section" : 3
      },
      "sort" : [ "all" ]
    }

The sorting used in the query to get the preceding document, was as follows:

"sort" : [
  { "title" : "asc" }
]

However, because we are sorting on an analyzed field, which contains more than a single value, the sorting definition is in fact equivalent to the longer form, which looks as follows:

"sort" : [
  { "title" : { "order" : "asc", "mode" : "min" }
]

mode defines which token should be used for comparison when sorting on a field which has more than one value. The available values we can choose from are:

  • min: Sorting will use the lowest value (or the first alphabetical value on the text based fields)
  • max: Sorting will use the highest value (or the last alphabetical value on the text based fields)
  • avg: Sorting will use the average value
  • median: Sorting will use the median value
  • sum: Sorting will use the sum of all the values in the field

    Note

    The modes such as median, avg, and sum are useful for numerical multivalued fields, but don't make much sense when it comes to text based fields.

Note that sort, in request and response, is given as an array. This suggests that we can use several different orderings. Elasticsearch will use the next element in the sorting definition list to determine ordering between the documents that have the same value of the previous sorting clause. So, if we have the same value in the title field, the documents will be sorted by the next field that we specify. For example, if we would like to get the documents that have the most copies and then sort by the title, we will run the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "terms" : {
      "title" : [ "crime", "front", "punishment" ]
    }
  },
  "sort" : [
    { "copies" : "desc" }, { "title" : "asc" }
  ]
}'

Specifying behavior for missing fields

What about when some of the documents that match the query don't have the field we want to sort on? By default, documents without the given field are returned first in the case of ascending order and last in the case of descending order. However, sometimes this is not exactly what we want to achieve.

When we use sorting on numeric fields, we can change the default Elasticsearch behavior for documents with missing fields. For example, let's take a look at the following query:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "match_all" : { }
  },
  "sort" : [
    { 
      "section" : {
        "order" : "asc",
        "missing" : "_last"
      }
    }
  ]
}'

Note the extended form of the sort section of our query. We've added the missing parameter to it. By setting the missing parameter to _last, Elasticsearch will place the documents without the given field at the bottom of the results list. Setting the missing parameter to _first will result in Elasticsearch placing documents without the given field at the top of the results list. It is worth mentioning that besides the _last and _first values, Elasticsearch also allows us to use any number. In such a case, a document without a defined field will be treated as the document with this given value.

Dynamic criteria

As we mentioned in the previous section, Elasticsearch allows us to sort using fields that have multiple values. We can control how the comparison is made using scripts. We do that by showing Elasticsearch how to calculate the value that should be used for sorting. Let's assume that we want to sort by the first value indexed in the tags field. Let's take a look at the following example query (note that running the following query requires the script.inline property set to on in the elasticsearch.yml file):

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : { 
    "match_all" : { }
  },
  "sort" : {
    "_script" : {
      "script" : "doc["tags"].values.size() > 0 ? doc["tags"].values[0] : "u19999"",
       "type" : "string",
       "order" : "asc"
     }
  }
}'

In the preceding example, we replaced every nonexistent value with the Unicode code of a character that should be low enough in the list. The main idea of this code is to check if our array contains at least a single element. If it does, then the first value from the array is returned. If the array is empty, we return the Unicode character that should be placed at the bottom of the results list. Besides the script parameter, this option of sorting requires us to specify the order (ascending, in our case) and type parameters that will be used for the comparison (we return string from our script).

Calculate scoring when sorting

By default, Elasticsearch assumes that when you use sorting, the score is completely unimportant. Usually it is a good assumption; why do additional computations when the importance of the documents is given by the sorting formula. Sometimes, however, you want to know how good the document is in relation to the current query, even if the documents are presented in a different order. This is when the track_scores parameter should be used and set to true. An example query using it looks as follows:

curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{
  "query" : {
    "match_all" : { }
  },
  "track_scores" : true,
  "sort" : [
    { "title" : { "order" : "asc" }}
 ]
}'

The preceding query calculates the score for every document. In fact, in our example, the score is boring and is always equal to 1.0 because of the match_all query which treats all the documents as equal.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset