The indices that live inside your Elasticsearch cluster can be built from many shards and each shard can have many replicas. The ability to divide a single index into multiple shards gives us the possibility of dividing the data into multiple physical instances. The reasons why we want to do this may be different. We may want to parallelize indexing to get more throughput, or we may want to have smaller shards so that our queries are faster. Of course, we may have too many documents to fit them on a single machine and we may want a shard because of this. With replicas, we can parallelize the query load by having multiple physical copies of each shard. We can say that, using shards and replicas, we can scale out Elasticsearch. However, Elasticsearch has to figure out where in the cluster it should place shards and replicas. It needs to figure out on which server/nodes each shard or replica should be placed.
One of the most common use cases that use explicit controlling of shards and replicas allocation in Elasticsearch is time-based data, that is, logs. Each log event has a timestamp associated with it; however, the amount of logs in most organizations is just enormous. The thing is that you need a lot of processing power to index them, but you don't usually search historical data. Of course, you may want to do that, but it will be done less frequently than the queries for the most recent data.
Because of this, we can divide the cluster into so called two tiers—the cold and the hot tier. The hot tier contains more powerful nodes, ones that have very fast disks, lots of CPU processing power, and memory. These nodes will handle both a lot of indexing as well as queries for recent data. The cold tier, on the other hand, will contain nodes that have very large disks, but are not very fast. We won't be indexing into the cold tier; we will only store our historical indices here and search them from time to time. With the default Elasticsearch behavior, we can't be sure where the shards and replicas will be placed, but luckily Elasticsearch allows us to control this.
The idea is to create the indices that index today's data on the hot nodes and, when we stop using it (when another day starts), we update the index settings so that it is moved to the tier called cold. Let's now see how we can do this.
So let's divide our cluster into two tiers. We say tiers, but they can be any name you want, we just like the term "tier" and it is commonly used. We assume that we have six nodes. We want our more powerful nodes numbered 1 and 2 to be placed in the tier called hot
and the nodes numbered 3, 4, 5, and 6, which are smaller in terms of CPU and memory, but very large in terms of disk space, to be placed in a tier called cold
.
To configure, we add the following property to the elasticsearch.yml
configuration file on nodes 1 and 2 (the ones that are more powerful):
node.tier: hot
Of course, we will add a similar property to the elasticsearch.yml
configuration file on nodes 3, 4, 5, and 6 (the less powerful ones):
node.tier: cold
Now let's create our daily index for today's data, one called logs_2015-12-10
. As we said earlier, we want this to be placed on the nodes in the hot tier. We do this by running the following commands:
curl -XPUT 'http://localhost:9200/logs_2015-12-10' -d '{ "settings" : { "index" : { "routing.allocation.include.tier" : "hot" } } }'
The preceding command will result in the creation of the logs_2015-12-10
index and specification of the index.routing.allocation.include.tier
property to it. We set this property to the hot
value, which means that we want to place the logs_2015-12-10
index on the nodes that have the node.tier
property set to hot
.
Now, when the day ends and we need to create a new index, we again put it on the hot nodes. We do this by running the following command:
curl -XPUT 'http://localhost:9200/logs_2015-12-11' -d '{ "settings" : { "index" : { "routing.allocation.include.tier" : "hot" } } }'
Finally, we need to tell Elasticsearch to move the index holding the data for the previous day to the cold tier. We do this by updating the index settings and setting the index.routing.allocation.include.tier
property to cold
. This is done using the following command:
curl -XPUT 'http://localhost:9200/logs_2015-12-10/_settings' -d '{ "index.routing.allocation.include.tier" : "cold" }'
After running the preceding command, Elasticsearch will start relocating the index called logs_2015-12-10
to the nodes that have the node.tier
property set to cold
in the elasticsearch.yml
file without any manual work needed from us.
In the same manner as we specified on which nodes the index should be placed, we can also exclude nodes from index allocation. Referring to the previously shown example. if we want the index called logs_2015-12-10
to not be placed on the nodes with the node.tier
property set to cold
, we would run the following command:
curl -XPUT 'localhost:9200/logs_2015-12-10/_settings' -d '{ "index.routing.allocation.exclude.tier" : "cold" }'
Notice that instead of the index.routing.allocation.include.tier
property, we've used the index.routing.allocation.exclude.tier
property.
In addition to inclusion and exclusion rules, we can also specify the rules that must match in order for a shard to be allocated to a given node. The difference is that when using the index.routing.allocation.include
property, the index will be placed on any node that matches at least one of the provided property values. Using index.routing.allocation.require
, Elasticsearch will place the index on a node that has all the defined values. For example, let's assume that we've set the following settings for the logs_2015-12-10
index:
curl -XPUT 'localhost:9200/logs_2015-12-10/_settings' -d '{ "index.routing.allocation.require.tier" : "hot", "index.routing.allocation.require.disk_type" : "ssd" }'
After running the preceding command, Elasticsearch would only place the shards of the logs_2015-12-10
index on a node with the node.tier
property set to hot
and the node.disk_type
property set to ssd
.
Instead of adding a special parameter to the nodes configuration, we are allowed to use IP addresses to specify which nodes we want to include or exclude from the shards and replicas allocation. In order to do this, instead of using the tier
part of the index.routing.allocation.include.tier
or index.routing.allocation.exclude.tier
properties, we should use the _ip
. For example, if we would like our logs_2015-12-10
index to be placed only on the nodes with the 10.1.2.10
and 10.1.2.11
IP addresses, we would run the following command:
curl -XPUT 'localhost:9200/logs_2015-12-10/_settings' -d '{ "index.routing.allocation.include._ip" : "10.1.2.10,10.1.2.11" }'
In addition to the already described allocation filtering methods, Elasticsearch gives us disk-based shard allocation rules. It allows us to set allocation rules based on the nodes' disk usage.
There are four properties that control the behavior of a disk-based shard allocation. All of them can be updated dynamically or set in the elasticsearch.yml
configuration file.
The first of these is cluster.info.update.interval
, which is by default set to 30 seconds and defines how often Elasticsearch updates information about disk usage on nodes.
The second property is the cluster.routing.allocation.disk.watermark.low
, which is by default set to 0.85
. This means that Elasticsearch will not allocate new shards to a node that uses more than 85% of its disk space.
The third property is the cluster.routing.allocation.disk.watermark.high
, which controls when Elasticsearch will start relocating shards from a given node. It defaults to 0.90
and means that Elasticsearch will start reallocating shards when the disk usage on a given node is equal to or more than 90%.
Both the cluster.routing.allocation.disk.watermark.low
and cluster.routing.allocation.disk.watermark.high
properties can be set to a percentage value (such as 0.60
, meaning 60%) and to an absolute value (such as 600mb
, meaning 600 megabytes).
Finally, the last property is cluster.routing.allocation.disk.include_relocations
, which by default is set to true
. It tells Elasticsearch to take into account the shards that are not yet copied to the node but Elasticsearch is in the process of doing that. Having this behavior turned on by default means that the disk-based allocation mechanism will be more pessimistic when it comes to available disk spaces (when shards are relocating), but we won't run into situations where shards can't be relocated because the assumptions about disk space were wrong.
The disk based shard allocation is enabled by default. We can disable it by specifying the cluster.routing.allocation.disk.threshold_enabled
property and setting it to false
. We can do this in the elasticsearch.yml
file or dynamically using the cluster settings API:
curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" : { "cluster.routing.allocation.disk.threshold_enabled" : false } }'
In addition to specifying shards and replicas allocation, we are also allowed to specify the maximum number of shards that can be placed on a single node for a single index. For example, if we would like our logs_2015-12-10
index to have only a single shard per node, we would run the following command:
curl -XPUT 'localhost:9200/logs_2015-12-10/_settings' -d '{ "index.routing.allocation.total_shards_per_node" : 1 }'
This property can be placed in the elasticsearch.yml
file or can be updated on live indices using the preceding command. Please remember that your cluster can stay in the red state if Elasticsearch won't be able to allocate all the primary shards.
The Elasticsearch allocation mechanism can be throttled, which means that we can control how much resources Elasticsearch will use during the shard allocation and recovery process. We are given five properties to control, which are as follows:
cluster.routing.allocation.node_concurrent_recoveries
: This property defines how many concurrent shard recoveries may be happening at the same time on a node. This defaults to 2
and should be increased if you would like more shards to be recovered at the same time on a single node. However, increasing this value will result in more resource consumption during recovery. Also, please remember that during the replica recovery process, data will be copied from the other nodes over the network, which can be slow.cluster.routing.allocation.node_initial_primaries_recoveries
: This property defaults to 4 and defines how many primary shards are recovered at the same time on a given node. Because primary shard recovery uses data from local disks, this process should be very fast.cluster.routing.allocation.same_shard.host
: A Boolean
property that defaults to false
and is applicable only when multiple Elasticsearch nodes are started on the same machine. When set to true, this will force Elasticsearch to check whether physical copies of the same shard are present on a single physical machine. The default false
value means no check is done.indices.recovery.concurrent_streams
: This is the number of network streams used to copy data from other nodes that can be used concurrently on a single node. The more the streams, the faster the data will be copied, but this will result in more resource consumption. This property defaults to 3
.indices.recovery.concurrent_small_file_streams
: This is similar to the indices.recovery.concurrent_streams
property, but defines how many concurrent data streams Elasticsearch will use to copy small files (ones that are under 5mb
in size). This property defaults to 2
.This allows us to perform a check to prevent the allocation of multiple instances of the same shard on a single host, based on host name and host address. This defaults to false
, meaning that no check is performed by default. This setting only applies if multiple nodes are started on the same machine.
In addition to the per indices allocation settings, Elasticsearch also allows us to control shard and indices allocation on a cluster-wide basis—so called shard allocation awareness. This is especially useful when we have nodes in different physical racks and we would like to place shards and replicas in different physical nodes.
Let's start with a simple example. We assume that we have a cluster built of four nodes. Each node in a different physical rack. The simple graphic that illustrates this is as follows:
As you can see, our cluster is built from four nodes. Each node was bound to a specific IP address and each node was given the tag
property and a group
property (added to elasticsearch.yml
as the node.tag
and node.group
properties). This cluster will serve the purpose of showing how shard allocation filtering works. The group
and tag
properties can be given whatever names you want, you just need to prefix your desired property name with the node
name, for example, if you would like to use a party
property name, you would just add node.party: party1
to your elasticsearch.yml
.
Allocation awareness allows us to configure shards and their replicas allocation with the use of generic parameters. In order to illustrate how allocation awareness works, we will use our example cluster. For the example to work, we should add the following property to the elasticsearch.yml
file:
cluster.routing.allocation.awareness.attributes: group
This will tell Elasticsearch to use the node.group
property as the awareness parameter.
After this, let's start the first two nodes, the ones with the node.group
parameter equal to groupA
, and let's create an index by running the following command:
curl -XPOST 'localhost:9200/awarness' -d '{ "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 } } }'
After this command, our two-node cluster will look more or less like this:
As you can see, the index was divided between the two nodes evenly. Now let's see what happens when we launch the rest of the nodes (the ones with node.group
set to groupB
):
Notice the difference—the primary shards were not moved from their original allocation nodes, but the replica shards were moved to the nodes with a different node.group
value. That's exactly right; when using shard allocation awareness, Elasticsearch won't allocate the primary shards and replicas of the same index to the nodes with the same value of the property used to determine the allocation awareness (which in our case is the node.group
).
Forcing allocation awareness can come in handy when we know, in advance, how many values our awareness attributes can take and we don't want more replicas than needed to be allocated in our cluster, for example, not to overload our cluster with too many replicas. For this, we can force the allocation awareness to be active only for certain attributes. We can specify these values using the cluster.routing.allocation.awareness.force.zone.values
property and providing a list of comma-separated values to it. For example, if we would like the allocation awareness to use only the groupA
and groupB
values of the node.group
property, we would add the following to the elasticsearch.yml
file:
cluster.routing.allocation.awareness.attributes: group cluster.routing.allocation.awareness.force.zone.values: groupA, groupB
Elasticsearch allows us to configure allocation for the entire cluster or for the index level. In the case of cluster allocation, we can use the properties prefixes:
cluster.routing.allocation.include
cluster.routing.allocation.require
cluster.routing.allocation.exclude
When it comes to index-specific allocation, we can use the following properties prefixes:
index.routing.allocation.include
index.routing.allocation.require
index.routing.allocation.exclude
The previously mentioned prefixes can be used with the properties that we've defined in the elasticsearch.yml
file (our tag
and group
properties) and with a special property called _ip
that allows us to match or exclude the use of the nodes' IP addresses, for example, like this:
cluster.routing.allocation.include._ip: 192.168.2.1
If we would like to include nodes with a group
property matching the groupA
value, we would set the following property:
cluster.routing.allocation.include.group: groupA
Notice that we've used the cluster.routing.allocation.include
prefix and we've concatenated it with the name of the property, which is group
in our case.
If you look closely at the preceding parameters, you will notice that there are three kinds:
include
: This type will result in including all the nodes with this parameter defined. If multiple include
conditions are visible than all the nodes that match at least a one of these conditions will be taken into consideration when allocating shards. For example, if we add two cluster.routing.allocation.include.tag
parameters to our configuration, one with a property with the value of node1
and second with the node2
value, we would end up with indices (actually their shards) being allocated to the first and second node (counting from left to right). To sum up the nodes that have the include allocation
parameter type will be taken into consideration by Elasticsearch when choosing the nodes to place shards on, but this doesn't mean that Elasticsearch will put shards in them.require
: This parameter, which was introduced in the Elasticsearch 0.90 type of allocation filter, requires all the nodes to have a value that matches the value of this property. For example, if we add one cluster.routing.allocation.require.tag
parameter to our configuration with the value of node1
and a cluster.routing.allocation.require.group
parameter with the value of groupA
, we would end up with shards allocated only to the first node (the one with an IP address of 192.168.2.1
).exclude
: This parameter allows us to exclude nodes with given properties from the allocation process. For example, if we set cluster.routing.allocation.include.tag
to groupA
, we would end up with indices being allocated only to the nodes with IP addresses 192.168.3.1
and 192.168.3.2
(the third and fourth nodes in our example).The property value can use simple wildcard characters. For example, if we want to include all the nodes that have the group
parameter value beginning with group
, we could set the cluster.routing.allocation.include.group
property to group*
. In the example cluster case, this would result in matching nodes with the groupA
and groupB group
parameter values.
The last thing we wanted to discuss is the ability to manually move shards between nodes. Elasticsearch exposes the _cluster/reroute
REST end-point, which allows us to control that. The following operations are available:
Now let's look closely at all of the preceding operations.
Let's say we have two nodes called es_node_one
and es_node_two
, and we have two shards of the shop
index placed by Elasticsearch on the first node and we would like to move the second shard to the second node. In order to do this, we can run the following command:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "move" : { "index" : "shop", "shard" : 1, "from_node" : "es_node_one", "to_node" : "es_node_two" } } ] }'
We've specified the move
command, which allows us to move shards (and replicas) of the index specified by the index
property. The shard
property is the number of shards we want to move. And, finally, the from_node
property specifies the name of the node we want to move the shard from and the to_node
property specifies the name of the node we want the shard to be placed on.
If we would like to cancel an on-going allocation process, we can run the cancel
command and specify the index, node, and shard we want to cancel the allocation for. For example:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "cancel" : { "index" : "shop", "shard" : 0, "node" : "es_node_one" } } ] }'
The preceding command would cancel the allocation of shard 0
of the shop
index on the es_node_one
node.
In addition to cancelling and moving shards and replicas, we are also allowed to allocate an unallocated shard to a specific node. For example, if we have an unallocated shard numbered 0
for the users
index and we would like it to be allocated to es_node_two
by Elasticsearch, we would run the following command:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "allocate" : { "index" : "users", "shard" : 0, "node" : "es_node_two" } } ] }'
We can, of course, include multiple commands in a single HTTP request. For example:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ {"move" : {"index" : "shop", "shard" : 1, "from_node" : "es_node_one", "to_node" : "es_node_two"}}, {"cancel" : {"index" : "shop", "shard" : 0, "node" : "es_node_one"}} ] }'
There is one more thing that we would like to discuss when it comes to shard and replica allocation—handling rolling restarts. When Elasticsearch is restarted, it may take some time to get it back to the cluster. During this time, the rest of the cluster may decide to do rebalancing and move shards around. When we know we are doing rolling restarts, for example, to update Elasticsearch to a new version or install a plugin, we may want to tell this to Elasticsearch. The procedure for restarting each node should be as follows:
First, before you do any maintenance, you should stop the allocation by sending the following command:
curl -XPUT 'localhost:9200/_cluster/settings' -d '{ "transient" : { "cluster.routing.allocation.enable" : "none" } }'
This will tell Elasticsearch to stop allocation. After this, we will stop the node we want to do maintenance on and start it again. After it joins the cluster, we can enable the allocation again by running the following:
curl -XPUT 'localhost:9200/_cluster/settings' -d '{ "transient" : { "cluster.routing.allocation.enable" : "all" } }'
This will enable the allocation again. This procedure should be repeated for each node we want to perform maintenance on.