Chapter 8. Beyond Full-text Searching

The previous chapter was fully dedicated to data analysis and how we can perform it with Elasticsearch. We learned how to use aggregations, what types of aggregation are available, and what aggregations are available within each type and how to use them. In this chapter, we will get back to query related topics. By the end of this chapter, you will have learned the following topics:

  • What is percolator and how to use it
  • What are the geospatial capabilities of Elasticsearch
  • How to use and build functionalities using Elasticsearch suggesters
  • How to use the Scroll API to efficiently fetch large numbers of results

Percolator

Have you ever wondered what would happen if we reverse the traditional model of using queries to find documents in Elasticsearch? Does it make sense to have a document and search for queries matching it? It is not surprising that there is a whole range of solutions where this model is very useful. Whenever you operate on an unbounded stream of input data, where you search for the occurrences of particular events, you can use this approach. This can be used for the detection of failures in a monitoring system or for the "Tell me when a product with the defined criteria will be available in this shop" functionality. In this section, we will look at how an Elasticsearch percolator works and how we can use it to implement one of the aforementioned use cases.

The index

In all the examples to be used when discussing percolator functionality, we will use an index called notifier. The mentioned index is created by using the following command:

curl -XPOST 'localhost:9200/notifier' -d '{
  "mappings": {
    "book" : {
      "properties" : {
        "title" : {
          "type" : "string"
        },
        "otitle" : {
          "type" : "string"
        },
        "year" : {
          "type" : "integer"
        },
        "available" : {
          "type" : "boolean"
        },
        "tags" : {
          "type" : "string",
          "index" : "not_analyzed"
        }
      }
    }
  }
}'

It is quite simple. It contains a single type and five fields, which will be used during our journey through the world.

Percolator preparation

Elasticsearch exposes a special type called .percolator that is treated differently. This means that we can store any documents and also search them like an ordinary type in any index. If you look at any Elasticsearch query, you will notice that each is a valid JSON document, which means that we can index and store it as a document as well. The thing is that percolator allows us to inverse the search logic and search for queries which match a given document. This is possible because of the two just discussed features: the special .percolator type and the fact that queries in Elasticsearch are valid JSON documents.

Let's get back to the library example from Chapter 2, Indexing Your Data, and try to index one of the queries in the percolator. We assume that our users need to be informed when any book matching the criteria defined by the query is available.

Look at the following query1.json file that contains an example query generated by the user:

{
  "query" : {
    "bool" : {
      "must" : {
        "term" : {
          "title" : "crime"
        }
      },
      "should" : {
        "range" : {
          "year" : {
            "gt" : 1900,
            "lt" : 2000
          }
        }
      },
      "must_not" : {
        "term" : {
          "otitle" : "nothing"
        }
      }
    }
  }
}

To enhance the example, we also assume that our users are allowed to define filters using our hypothetical user interface. For example, our user may be interested in the available books that were written before the year 2010. An example query that could have been constructed by such a user interface would look as follows (the query was written to the query2.json file):

{
  "query" : {
    "bool": {
      "must" : {
        "range" : {
          "year" : {
            "lt" : 2010
          }
        }
      },
      "filter" : {
        "term" : {
          "available" : true
        }
      }
    }
  }
}

Now, let's register both queries in the percolator (note that we are registering the queries and haven't indexed any documents). In order to do this, we will run the following commands:

curl -XPUT 'localhost:9200/notifier/.percolator/1' -d @query1.json
curl -XPUT 'localhost:9200/notifier/.percolator/old_books' -d @query2.json

In the preceding examples, we used two completely different identifiers. We did that in order to show that we can use an identifier that best describes the query. It is up to us to decide under which name we would like the query to be registered.

We are now ready to use our percolator. Our application will provide documents to the percolator and check if any of the already registered queries match the document. This is exactly what a percolator allows us to do - to reverse the search logic. Instead of indexing the documents and running queries against them, we store the queries and send the documents to find the matching queries.

Let's use an example document that will match both stored queries; it will have the required title and the release date, and will mention whether it is currently available. The command to send such a document to the percolator looks as follows:

curl -XGET 'localhost:9200/notifier/book/_percolate?pretty' -d '{
  "doc" : {
    "title": "Crime and Punishment",
    "otitle": "Преступлéние и наказáние",
    "author": "Fyodor Dostoevsky",
    "year": 1886,
    "characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"], 
      "tags": [],
    "copies": 0,
    "available" : true
  }
}'

As we expected, both queries matched and the Elasticsearch response includes the identifiers of the matching queries. Such a response looks as follows:

{
  "took" : 36,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "total" : 2,
  "matches" : [ {
    "_index" : "notifier",
    "_id" : "old_books"
  }, {
    "_index" : "notifier",
    "_id" : "1"
  } ]
}

This works like a charm. One very important thing to note is the endpoint used in this query: _percolate. Using this endpoint is required when we want to use the percolator. The index name corresponds to the index where the queries were stored, and the type is equal to the type defined in the mappings.

Note

The response format contains information about the index and the query identifier. This information is included for cases when we search against multiple indices at once. When using a single index, adding an additional query parameter, percolate_format=ids, will change the response as follows:

  "matches" : [ "old_books", "1" ]

Getting deeper

Because the queries registered in a percolator are in fact documents, we can use a normal query sent to Elasticsearch in order to choose which queries stored in the .percolator type should be used in the percolation process. This may sound weird, but it really gives a lot of possibilities. In our library, we can have several groups of users. Let's assume that some of them have permissions to borrow very rare books, or that we have several branches in the city and the user can declare where he or she would like to get the book from.

Let's see how such use cases can be implemented by using the percolator. To do this, we will need to update our mapping and include the branch information. We do that by running the following command:

curl -XPOST 'localhost:9200/notifier/.percolator/_mapping' -d '{
  ".percolator" : {
    "properties" : {
      "branches" : {
        "type" : "string",
        "index" : "not_analyzed"
      }
    }
  }
}'

Now, in order to register a query, we use the following command:

curl -XPUT 'localhost:9200/notifier/.percolator/3' -d '{
  "query" : {
    "term" : {
      "title" : "crime"
    }
  },
  "branches" : ["brA", "brB", "brD"]
}'

In the preceding example, we registered a query that shows a user's interest. Our hypothetical user is interested in any book with the term crime in the title field (the term query is responsible for this). He or she wants to borrow this book from one of the three listed branches. When specifying the mappings, we defined that the branches field is a non-analyzed string field. We can now include a query along with the document we sent previously. Let's look at how to do this.

Our book system just got the book, and it is ready to report the book and check whether the book is of interest to anyone. To check this, we send the document that describes the book and add an additional query to such a request - the query that will limit the users to only the ones interested in the brB branch. Such a request looks as follows:

curl -XGET 'localhost:9200/notifier/book/_percolate?pretty' -d '{
  "doc" : {
    "title": "Crime and Punishment",
    "otitle": "
Преступлéние и наказáние
",
    "author": "Fyodor Dostoevsky",
    "year": 1886,
    "characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"], 
      "tags": [],
    "copies": 0,
    "available" : true
  },
  "size" : 10,
  "filter" : {
    "term" : {
      "branches" : "brB"
    }
  }
}'

If everything was executed correctly, the response returned by Elasticsearch should look as follows (we indexed our query with 3 as an identifier):

{
  "took" : 27,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "total" : 1,
  "matches" : [ {
    "_index" : "notifier",
    "_id" : "3"
  } ]
}

Controlling the size of returned results

The size of the results when it comes to percolator makes the difference. The more queries a single document matches, the more results will be returned and more memory will be needed by Elasticsearch. Because of this, there is one additional thing to note - the size parameter. It allows us to limit the number of matches returned.

Percolator and score calculation

In the previous examples, we filtered our queries using a single term query, but we didn't think about the scoring process at all. Elasticsearch allows us to calculate the score when using the percolator. Let's change the previously used document sent to the percolator and adjust it so that scoring is used:

curl -XGET 'localhost:9200/notifier/book/_percolate?pretty' -d '{
  "doc" : {
    "title": "Crime and Punishment",
    "otitle": "Преступлéние и наказáние",
    "author": "Fyodor Dostoevsky",
    "year": 1886,
    "characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"], 
      "tags": [],
    "copies": 0,
    "available" : true
  },
  "size" : 10,
  "query" : {
    "term" : {
      "branches" : "brB"
    }
  },
  "track_scores" : true,
  "sort" : {
    "_score" : "desc"
  }
}'

As you can see, we used the query section and included an additional track_scores attribute set to true. This is needed, because by default Elasticsearch won't calculate the score for the documents because of performance. If we need scores in the percolation process, we should be aware that such queries will be slightly more demanding when it comes to CPU processing power than the ones that omit calculating the score.

Note

In the preceding example, we told Elasticsearch to sort our result on the basis of the score in descending order. This is the default behavior when track_scores is turned on, so we can omit sort declaration. At the time of writing, sorting on score in descending direction is the only available option.

Combining percolators with other functionalities

If we are allowed to use queries along with the documents sent for percolation, why can we not use other Elasticsearch functionalities? Of course, this is possible. For example, the following document is sent along with an aggregation and the results will include the aggregation calculation:

curl -XGET 'localhost:9200/notifier/book/_percolate?pretty' -d '{
  "doc": {
    "title": "Crime and Punishment",
    "available": true
  },
  "aggs" : {
    "test" : {
      "terms" : {
        "field" : "branches"
      }
    }
  }
}'

As we can see, percolator allows us to run both query and aggregations. Look at the following example document:

curl -XGET 'localhost:9200/notifier/book/_percolate?pretty' -d '{
  "doc": {
    "title": "Crime and Punishment",
    "year": 1886,
    "available": true
  },
  "size" : 10,
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}'

As you can see, it contains a highlighting section. A fragment of the response returned by Elasticsearch looks as follows:

  {
    "_index" : "notifier",
    "_id" : "3",
    "highlight" : {
      "title" : [ "<em>Crime</em> and Punishment" ]
    }
  }

Note

Note that there are some limitations when it comes to the query types supported by the percolator functionality. In the current implementation, parent-child relations are not available in the percolator, so you can't use queries such as has_child, top_children, and has_parent.

Getting the number of matching queries

Sometimes you don't care about the matched queries and you only want the number of matched queries. In such cases, sending a document against the standard percolator endpoint is not efficient. Elasticsearch exposes the _percolate/count endpoint to handle such cases in an efficient way. An example of such a command follows:

curl -XGET 'localhost:9200/notifier/book/_percolate/count?pretty' -d '{
 "doc" : { ... }
 }'

Indexed document percolation

In the final, closing paragraph of the percolation section, we want to show you one more thing – the possibility of percolating a document that is already indexed. To do this, we need to use the GET operation on the document and provide information about which percolator index should be used. Let's look at the following command:

curl -XGET 'localhost:9200/library/book/1/_percolate?percolate_index=notifier'

This command checks the document with the 1 identifier from our library index against the percolator index defined by the percolate_index parameter. Remember that, by default, Elasticsearch will use the percolator in the same index as the document; that's why we've specified the percolate_index parameter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset