Introduction to routing

By default, Elasticsearch will try to distribute your documents evenly among all the shards of the index. However, that's not always the desired situation. In order to retrieve the documents, Elasticsearch must query all the shards and merge the results. What if we could divide our data on some basis (for example, the client identifier) and use that information to put data with the same properties in the same place in the cluster. Elasticsearch allows us to do that by exposing a powerful document and query distribution control mechanism routing. In short, it allows us to choose a shard to be used to index or search the data.

Default indexing

During indexing operations, when you send a document for indexing, Elasticsearch looks at its identifier to choose the shard in which the document should be indexed. By default, Elasticsearch calculates the hash value of the document's identifier and, on the basis of that, it puts the document in one of the available primary shards. Then, those documents are redistributed to the replicas. The following diagram shows a simple illustration of how indexing works by default:

Default indexing

Default searching

Searching is a bit different from indexing, because in most situations you need to query all the shards to get the data you are interested in (we will talk about that in Chapter 3, Searching Your Data), at least in the initial scatter phase of the query. Imagine a situation when you have the following mappings describing your index:

   {
     "mappings" : {
       "post" : {
         "properties" : {
           "id" : { "type" : "long" },
           "name" : { "type" : "string" },
           "contents" : { "type" : "string" },
           "userId" : { "type" : "long" }
    } } 
} }

As you can see, our index consists of four fields: the identifier (the id field), name of the document (the name field), contents of the document (the contents field), and the identifier of the user to which the documents belong (the userId field). To get all the documents for a particular user, one with userId equal to 12, you can run the following query:

curl –XGET 'http://localhost:9200/posts/_search?q=userId:12'

Depending on the search type (we will talk more about it in Chapter 3, Searching Your Data), Elasticsearch will run your query. It usually means that it will first query all the nodes for the identifiers and score of the matching documents and then it will send an internal query again, but only to the relevant shards (the ones containing the needed documents) to get the documents needed to build the response.

A very simplified view of how the default searching works during its initial phase is shown in the following illustration:

Default searching

What if we could put all the documents for a single user into a single shard and query on that shard? Wouldn't that be wise for performance? Yes, that is handy and that is what routing allows you do to.

Routing

Routing can control which shard your documents and queries will be forwarded to. By now, you will probably have guessed that we can specify the routing value both during indexing and during querying and, in fact, if you decide to specify explicit routing values, you'll probably want to do that during indexing and searching.

In our case, we will use the userId value to set routing during indexing and the same value will be used during searching. Because we will use the same routing value for all the documents for a single user, the same hash value will be calculated and thus all the documents for that particular user will be placed in the same shard. Using the same value during search will result in searching a single shard instead of the whole index.

There is one thing you should remember when using routing when searching. When searching, you should add a query part that will limit the returned documents to the ones for the given user. Routing is not enough. This is because you'll probably have more distinct routing values than the number of shards your index will be built with. For example, you can have 10 shards building your index, but at the same time have hundreds of users. It is physically impossible to dedicate a single shard to only a single user. It is usually not good from a scaling point for view as well. Because of that, a few distinct values can point to the same shard – in our case data of a few users will be placed in the same shard. Because of that, we need a query part that will limit the data to a particular user identifier, such as a term query.

The following diagram shows a very simple illustration of how searching works with a provided custom routing value:

Routing

As you can see, Elasticsearch will send our query to a single shard. Now let's look at how we can specify the routing values.

The routing parameters

The idea is very simple. The endpoint used for all the operations connected with fetching or storing documents in Elasticsearch allows us to use additional parameter called routing. You can add it to your HTTP or set it by using the client library of your choice.

So, in order to index a sample document to the previously shown index, we will use the following command:

curl -XPUT 'http://localhost:9200/posts/post/1?routing=12' -d '{
  "id": "1",
  "name": "Test document",
  "contents": "Test document",
  "userId": "12"
}'

If we now get back to our previous query fetching our user's data and we modify it to use routing, it would look as follows:

curl -XGET 'http://localhost:9200/posts/_search?routing=12&q=userId:12'

As you can see, the same routing value was used during indexing and querying. This is possible in most cases when routing is used. We know which user data we are indexing and we will probably know which user is searching for the data. In our case, our imaginary user was given the identifier of 12 and we used that value during indexing and searching.

Note that during searching you can specify multiple routing values separated by commas. For example, if we want the preceding query to be additionally routed by the value of the section parameter (if it existed) and we also want to filter by this parameter, our query will look like the following:

curl -XGET 'http://localhost:9200/posts/_search?routing=12,6654&q=userId:12+AND+section:6654'

Of course, the preceding command can match multiple shards now as the values given to routing can point to multiple shards. Because of that you need to provide only a single routing value during indexation (Elasticsearch needs to be pointed to a single shard or indexation will fail). You can of course query multiple shards at the same time and because of that multiple routing values can be provided during searching.

Note

Remember that routing is not the only thing that is required to get results for a given user. That's because usually we have few shards that have unique routing values. This means that we will have data from multiple users in a single shard. So, when using routing, you should also narrow down your results to the ones for a given user. You'll learn more about how you can do that in Chapter 3, Searching Your Data.

Routing fields

Specifying the routing value with each request is critical when using an index operation. Without it, Elasticsearch uses the default way of determining where the document should be stored – it uses the hash value of the document identifier. This may lead to a situation where one document exists in many versions on different shards. A similar situation may occur when fetching the document. When a document is stored with a given routing value, we may hit the wrong shard and the document may be not found.

In fact, Elasticsearch allows us to change the default behavior and forces us to use routing when querying a given index. To do that, we need to add the following section to our type definition:

   "_routing" : {
     "required" : true
   }

The preceding definition means that the routing value needs to be provided (the "required": true property); without it, an index request will fail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset