Routing explained

In the Choosing the right amount of shards and replicas section in this chapter, we mentioned routing as a solution for the shards on which queries will be executed on a single one. Now it's time to look closer at this functionality.

Shards and data

Usually, it is not important how Elasticsearch divides data into shards and which shard holds the particular document. During query time, the query will be sent to all the shards of a particular index, so the only crucial thing is to use the algorithm that spreads our data evenly so that each shard contains similar amounts of data. We don't want one shard to hold 99 percent of the data while the other shard holds the rest—it is not efficient.

The situation complicates slightly when we want to remove or add a newer version of the document. Elasticsearch must be able to determine which shard should be updated. Although it may seem troublesome, in practice, it is not a huge problem. It is enough to use the sharding algorithm, which will always generate the same value for the same document identifier. If we have such an algorithm, Elasticsearch will know which shard to point to when dealing with a document.

However, there are times when it would be nice to be able to hit the same shard for some portion of data. For example, we would like to store every book of a particular type only on a particular shard and, while searching for that kind of book, we could avoid searching on many shards and merging results from them. Instead, because we know the value we used for routing, we could point Elasticsearch to the same shard we used during indexing. This is exactly what routing does. It allows us to provide information that will be used by Elasticsearch to determine which shard should be used for document storage and for querying; the same routing value will always result in the same shard. It's basically something like saying "search for documents on the shard where you've put the documents by using the provided routing value".

Let's test routing

To show you an example that will illustrate how Elasticsearch allocates shards and which documents are placed on the particular shard, we will use an additional plugin. It will help us visualize what Elasticsearch did with our data. Let's install the Paramedic plugin using the following command:

bin/plugin -install karmi/elasticsearch-paramedic

After restarting Elasticsearch, we can point our browser to http://localhost:9200/_plugin/paramedic/index.html and we will able to see a page with various statistics and information about indices. For our example, the most interesting information is the cluster color that indicates the cluster state and the list of shards and replicas next to every index.

Let's start two Elasticsearch nodes and create an index by running the following command:

curl -XPUT 'localhost:9200/documents' -d '{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 2
  }
}'

We've created an index without replicas, which is built of two shards. This means that the largest cluster can have only two nodes, and each next node cannot be filled with data unless we increase the number of replicas (you can read about this in the Choosing the right amount of shards and replicas section of this chapter). The next operation is to index some documents; we will do that by using the following commands:

curl -XPUT localhost:9200/documents/doc/1 -d '{ "title" : "Document No. 1" }'
curl -XPUT localhost:9200/documents/doc/2 -d '{ "title" : "Document No. 2" }'
curl -XPUT localhost:9200/documents/doc/3 -d '{ "title" : "Document No. 3" }'
curl -XPUT localhost:9200/documents/doc/4 -d '{ "title" : "Document No. 4" }'

After that, if we would look at the installed Paramedic plugin, we would see our two primary shards created and assigned.

Let's test routing

In the information about nodes, we can also find the information that we are currently interested in. Each of the nodes in the cluster holds exactly two documents. This leads us to the conclusion that the sharding algorithm did its work perfectly, and we have an index that is built of shards that have evenly redistributed documents.

Now, let's create some chaos and let's shut down the second node. Now, using Paramedic, we should see something like this:

Let's test routing

The first information we see is that the cluster is now in the red state. This means that at least one primary shard is missing, which tells us that some of the data is not available and some parts of the index are not available. Nevertheless, Elasticsearch allows us to execute queries; it is our decision as to what applications should do—inform the user about the possibility of incomplete results or block querying attempts. Let's try to run a simple query by using the following command:

curl -XGET 'localhost:9200/documents/_search?pretty'

The response returned by Elasticsearch will look as follows:

{
  "took" : 26,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "documents",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{ "title" : "Document No. 2" }
    }, {
      "_index" : "documents",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 1.0,
      "_source":{ "title" : "Document No. 4" }
    } ]
  }
}

As you can see, Elasticsearch returned the information about failures; we can see that one of the shards is not available. In the returned result set, we can only see the documents with identifiers of 2 and 4. Other documents have been lost, at least until the failed primary shard is back online. If you start the second node, after a while (depending on the network and gateway module settings), the cluster should return to the green state and all documents should be available. Now, we will try to do the same using routing, and we will try to observe the difference in the Elasticsearch behavior.

Indexing with routing

With routing, we can control the target shard Elasticsearch will choose to send the documents to by specifying the routing parameter. The value of the routing parameter is irrelevant; you can use whatever value you choose. The important thing is that the same value of the routing parameter should be used to place different documents together in the same shard. To say it simply, using the same routing value for different documents will ensure us that these documents will be placed in the same shard.

There are a few possibilities as to how we can provide the routing information to Elasticsearch. The simplest way is add the routing URI parameter when indexing a document, for example:

curl -XPUT localhost:9200/books/doc/1?routing=A -d '{ "title" : "Document" }'

Of course, we can also provide the routing value when using bulk indexing. In such cases, routing is given in the metadata for each document by using the _routing property, for example:

curl -XPUT localhost:9200/_bulk --data-binary '
{ "index" : { "_index" : "books", "_type" : "doc", "_routing" : "A" }}
{ "title" : "Document" }
'

Another option is to place a _routing field inside the document. However, this will work properly only when the _routing field is defined in the mappings. For example, let's create an index called books_routing by using the following command:

curl -XPUT 'localhost:9200/books_routing' -d '{
  "mappings": {
    "doc": {
      "_routing": {
        "required": true,
        "path": "_routing"
      },
      "properties": {
        "title" : {"type": "string" }
      }
    }
  }
}'

Now we can use _routing inside the document body, for example, like this:

curl -XPUT localhost:9200/books_routing/doc/1 -d '{ "title" : "Document", "_routing" : "A" }'

In the preceding example, we used a _routing field. It is worth mentioning that the path parameter can point to any field that's not analyzed from the document. This is a very powerful feature and one of the main advantages of the routing feature. For example, if we extend our document with the library_id field's indicated library where the book is available, it is logical that all queries based on library can be more effective when we set up routing based on this library_id field. However, you have to remember that getting the routing value from a field requires additional parsing.

Routing in practice

Now let's get back to our initial example and do the same as what we did but now using routing. The first thing is to delete the old documents. If we do not do this and add documents with the same identifier, routing may cause that same document to now be placed in the other shard. Therefore, we run the following command to delete all the documents from our index:

curl -XDELETE 'localhost:9200/documents/_query?q=*:*'

After that, we index our data again, but this time, we add the routing information. The commands used to index our documents now look as follows:

curl -XPUT localhost:9200/documents/doc/1?routing=A -d '{ "title" : "Document No. 1" }'
curl -XPUT localhost:9200/documents/doc/2?routing=B -d '{ "title" : "Document No. 2" }'
curl -XPUT localhost:9200/documents/doc/3?routing=A -d '{ "title" : "Document No. 3" }'
curl -XPUT localhost:9200/documents/doc/4?routing=A -d '{ "title" : "Document No. 4" }'

As we said, the routing parameter tells Elasticsearch in which shard the document should be placed. Of course, it may happen that more than a single document will be placed in the same shard. That's because you usually have less shards than routing values. If we now kill one node, Paramedic will again show you the red cluster and the state. If we query for all the documents, Elasticsearch will return the following response (of course, it depends which node you kill):

curl -XGET 'localhost:9200/documents/_search?q=*&pretty'

The response from Elasticsearch would be as follows:

{
  "took" : 24,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "documents",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{ "title" : "Document No. 1" }
    }, {
      "_index" : "documents",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{ "title" : "Document No. 3" }
    }, {
      "_index" : "documents",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 1.0,
      "_source":{ "title" : "Document No. 4" }
    } ]
  }
}

In our case, the document with the identifier 2 is missing. We lost a node with the documents that had the routing value of B. If we were less lucky, we could lose three documents!

Querying

Routing allows us to tell Elasticsearch which shards should be used for querying. Why send queries to all the shards that build the index if we want to get data from a particular subset of the whole index? For example, to get the data from a shard where routing A was used, we can run the following query:

curl -XGET 'localhost:9200/documents/_search?pretty&q=*&routing=A'

We just added a routing parameter with the value we are interested in. Elasticsearch replied with the following result:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "documents",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0, "_source" : { "title" : "Document No. 1" }
    }, {
      "_index" : "documents",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 1.0, "_source" : { "title" : "Document No. 3" }
    }, {
      "_index" : "documents",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 1.0, "_source" : { "title" : "Document No. 4" }
    } ]
  }
}

Everything works like a charm. But look closer! We forgot to start the node that holds the shard with the documents that were indexed with the routing value of B. Even though we didn't have a full index view, the reply from Elasticsearch doesn't contain information about shard failures. This is proof that queries with routing hit only a chosen shard and ignore the rest. If we run the same query with routing=B, we will get an exception like the following one:

{
  "error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed]",
  "status" : 503
}

We can test the preceding behavior by using the Search Shard API. For example, let's run the following command:

curl -XGET 'localhost:9200/documents/_search_shards?pretty&routing=A' -d '{"query":"match_all":{}}'

The response from Elasticsearch would be as follows:

{
  "nodes" : {
    "QK5r_d5CSfaV1Wx78k633w" : {
      "name" : "Western Kid",
      "transport_address" : "inet[/10.0.2.15:9301]"
    }
  },
  "shards" : [ [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "QK5r_d5CSfaV1Wx78k633w",
    "relocating_node" : null,
    "shard" : 0,
    "index" : "documents"
  } ] ]
}

As we can see, only a single node will be queried.

There is one important thing that we would like to repeat. Routing ensures us that, during indexing, documents with the same routing value are indexed in the same shard. However, you need to remember that a given shard may have many documents with different routing values. Routing allows you to limit the number of shards used during queries, but it cannot replace filtering! This means that a query with routing and without routing should have the same set of filters. For example, if we use user identifiers as routing values if we search for that user's data, we should also include filters on that identifier.

Aliases

If you work as a search engine specialist, you probably want to hide some configuration details from programmers in order to allow them to work faster and not care about search details. In an ideal world, they should not worry about routing, shards, and replicas. Aliases allow us to use shards with routing as ordinary indices. For example, let's create an alias by running the following command:

curl -XPOST 'http://localhost:9200/_aliases' -d '{
  "actions" : [
    {
      "add" : {
        "index" : "documents",
        "alias" : "documentsA",
        "routing" : "A"
      }
    }
  ]
}'

In the preceding example, we created a named documentsA alias from the documents index. However, in addition to that, searching will be limited to the shard used when routing value A is used. Thanks to this approach, you can give information about the documentsA alias to developers, and they may use it for querying and indexing like any other index.

Multiple routing values

Elasticsearch gives us the possibility to search with several routing values in a single query. Depending on which shard documents with given routing values are placed, it could mean searching on one or more shards. Let's look at the following query:

curl -XGET 'localhost:9200/documents/_search?routing=A,B'

After executing it, Elasticsearch will send the search request to two shards in our index (which in our case, happens to be the whole index), because the routing value of A covers one of two shards of our index and the routing value of B covers the second shard of our index.

Of course, multiple routing values are supported in aliases as well. The following example shows you the usage of these features:

curl -XPOST 'http://localhost:9200/_aliases' -d '{
  "actions" : [
    {
      "add" : {
        "index" : "documents",
        "alias" : "documentsA",
        "search_routing" : "A,B",
        "index_routing" : "A"
      }
    }
  ]
}'

The preceding example shows you two additional configuration parameters we didn't talk about until now—we can define different values of routing for searching and indexing. In the preceding case, we've defined that during querying (the search_routing parameter) two values of routing (A and B) will be applied. When indexing (index_routing parameter), only one value (A) will be used. Note that indexing doesn't support multiple routing values, and you should also remember proper filtering (you can add it to your alias).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset