Chapter 9. Elasticsearch Cluster in Detail

The previous chapter was fully dedicated to search functionalities that are not only about full text searching. We learned how to use percolator – an inversed search that allows us to build altering functionalities on top of Elasticsearch. We learned to use spatial functionalities of Elasticsearch and we used the suggest API that allowed us to correct user's spelling mistakes as well as build very efficient autocomplete functionalities. But let's now focus on running and administering Elasticsearch. By the end of this chapter, you will have learned the following topics:

  • How does Elasticsearch find new nodes that should join the cluster
  • What are the gateway and recovery modules
  • How do templates work
  • How to use dynamic templates
  • How to use the Elasticsearch plugin mechanism
  • What are the caches in Elasticsearch and how to tune them
  • How to use the Update Settings API to update Elasticsearch settings on running clusters

Understanding node discovery

When starting your Elasticsearch node, one of the first things that happens is looking for a master node that has the same cluster name and is visible. If a master is found, the node gets joined into an already formed cluster. If no master is found, then the node itself is selected as a master (of course if the configuration allows such behavior). The process of forming a cluster and finding nodes is called discovery. The module responsible for discovery has two main purposes: electing a master and discovering new nodes within a cluster. In this section, we will discuss how we can configure and tune the discovery module.

Discovery types

By default, without installing additional plugins, Elasticsearch allows us to use Zen discovery, which provides us with unicast discovery. Unicast (http://en.wikipedia.org/wiki/Unicast) allows transmission of a single message over the network to a single host at once. Elasticsearch node sends the message to the nodes defined in the configuration and waits for a response. When the node is accepted into the cluster, the recovery module kicks in and starts the recovery process if needed, or the master election process if the master is still not elected.

Note

Prior to Elasticsearch 2.0, the Zen discovery module allowed us to use multicast discovery. On a multicast capable network, Elasticsearch was able to automatically discover nodes without specifying any IP addresses of other Elasticsearch servers sharing the same cluster name. This was very mistake prone and not advised for production use and thus it was deprecated and removed to a plugin.

Elasticsearch architecture is designed to be peer to peer. When running operations such as indexing or searching, the master node doesn't take part in communication and the relevant nodes communicate with each other directly.

Node roles

Elasticsearch nodes can be configured to work in one of the following roles:

  • Master: The node responsible for maintaining the global cluster state, changing it depending on the needs, and handling the addition and removal of nodes. There can only be a single master node active in a single cluster.
  • Data: The node responsible for holding the data and executing data related operations (indexation and searching) on the shards that are present locally for the node.
  • Client: The node responsible for handling requests. For the indexing requests, the client node forwards the request to the appropriate primary shard and, for the search requests, it sends it to all the relevant shards and aggregates the results.

By default, each node can work as master, data, or client. It can be a data and a client at the same time for example. On large and highly loaded clusters, it is very important to divide the roles of the nodes in the cluster and have the nodes do only a single role at a time. When dealing with such clusters, you will often see at least three master nodes, multiple data nodes, and a few client only nodes as part of the whole cluster.

Master node

It is the most important node type from Elasticsearch cluster's point of view. It handles the cluster state, changes it, manages the nodes joining and leaving the cluster, checks the health of the other nodes in the cluster (by running ping requests), and manages the shard relocation operations. If the master is somehow disconnected from the cluster, the remaining nodes will select a new master from each other. All these processes are done automatically on the basis of the configuration values we provide. You usually want the master nodes to only communicate with the other Elasticsearch nodes, using the internal Java communication. To avoid hitting the master nodes by mistake, it is advised to turn off the HTTP module for them in the configuration.

Data node

The data node is responsible for holding the data in the indices. The data nodes are the ones that need the most disk space because of being loaded with data indexation requests and running searches on the data they have locally. The data nodes, similar to the master nodes can have the HTTP module disabled.

Client node

The client nodes are in most cases nodes that don't have any data and are not master nodes. The client nodes are the ones that communicate with the outside world and with all the nodes in the cluster. They forward the data to the appropriate shards and aggregate the search and aggregations results from all the other nodes.

Keep in mind that client nodes can have data as well, but in such a case they will run both the indexing requests and the search requests for the local data and will aggregate the data from the other nodes, which in large clusters may be too much work for a single node.

Configuring node roles

By default, Elasticsearch allows every node to be a master node, a data node, or a client node. However, as we already mentioned, in certain situations you may want to have nodes that only hold data, client nodes that are only used to process requests, and master hosts to manage the cluster. One such situation is when massive amounts of data needs to be handled, where the data nodes should be as performant as possible. To tell Elasticsearch what role it should take, we use three Boolean properties set in the elasticsearch.yml configuration file:

  • node.master: When set to true, we tell Elasticsearch that the node is master eligible, which means that it can take the role of a master. However, note that the master will be automatically marked as not master eligible as soon as it is assigned a client role.
  • node.data: When set to true, we tell Elasticsearch that the node can be used to hold data.
  • node.client: When set to true, we tell Elasticsearch that the node should be used as a client.

So, to set a node to only hold data, we should add the following properties to the elasticsearch.yml configuration file:

node.master: false
node.data: true
node.client: false

To set the node to not hold data and only be a master node, we need to instruct Elasticsearch that we don't want the node to hold data. In order to do this, we add the following properties to the elasticsearch.yml configuration file:

node.master: true
node.data: false
node.client: false

Setting the cluster's name

If we don't set the cluster.name property in our elasticsearch.yml file, Elasticsearch uses the elasticsearch default value. This is not a good thing, because each new Elasticsearch node will have the same cluster name and you may want to have multiple clusters in the same network. In such a case, connecting the wrong nodes together is just a matter of time. Because of that, we suggest setting the cluster.name property to some other value of your choice. Usually, it is a good idea to adjust cluster names based on cluster responsibilities.

Zen discovery

The default discovery method used by Elasticsearch and one that is commonly used in the Elasticsearch world is called Zen discovery. It supports unicast discovery and allows adjusting various parts of its configuration.

Note

Note that there are additional discovery types available as plugins, such as Amazon EC2 discovery, Microsoft Azure discovery, and Google Compute Engine discovery.

Master election configuration

Imagine that you have a cluster that is built of 10 nodes. Everything is working fine until one day when your network fails and 3 of your nodes are disconnected from the cluster, but they still see each other. Because of the Zen discovery and master election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name, with two master nodes. Such a situation is called a split-brain and you must avoid it as much as possible. When split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed. The thing to remember is that split-brain may result in not recoverable errors, such as data conflicts in which you end up with data corruption or partial data loss. That's why it is important to avoid such situations at all costs.

In order to prevent split-brain situations, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines the minimum amount of master eligible nodes that should be connected to each other in order to form a cluster. So now let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available + 1 (which is 6 in our case), we will end up with a single cluster. Why is that? Before the network failure, we had 10 nodes, which is more than six nodes, and those nodes formed a cluster. After the disconnection of the three nodes, we would still have the first cluster up and running. However, because only three nodes got disconnected and three is less than six, these three nodes wouldn't be allowed to elect a new master and they would wait for reconnection with the original cluster.

Of course this is also not a perfect scenario. It is advised to have a dedicated master eligible nodes only, that don't work as data or client nodes. To have a quorum in such a case, we need at least three dedicated master eligible nodes, because that will allow us to have a single master offline and still keep the quorum. This is usually enough to keep the clusters in a good shape when it comes to master related features and to be split-brain proof. Of course, in such a case, the discovery.zen.minimum_master_nodes property should be set to 2 and we should have the three master nodes up and running.

Furthermore, Elasticsearch allows us to additionally specify two additional Boolean properties: discover.zen.master_election.filter_client and discover.zen.master_election.filter_data. They allow us to tell Elasticsearch to ignore ping requests from the client and data nodes during master election. By default, the first mentioned property is set to true and the second is set to false. This allows Elasticsearch to focus on the master election and not be overloaded with ping requests from the nodes that are not master eligible.

In addition to the mentioned properties, Elasticsearch allows configuring timeouts related to the master election process. discovery.zen.ping_timeout, which defaults to 3s (three seconds), allows configuring timeout for slow networks – the higher the value, the lesser the chance of failure, but the election process can take longer. The second property is called discover.zen.join_timeout and specifies the timeout for the join request to the master. It defaults to 20 times the discovery.zen.ping_timeout property.

Configuring unicast

Because of the way unicast works, we need to specify at least a host that the unicast message should be sent to. To do this, we should add the discovery.zen.ping.unicast.hosts property to our elasticsearch.yml configuration file. Basically, we should specify all the hosts that form the cluster in the discovery.zen.ping.unicast.hosts property (we don't have to specify all the hosts, we just need to provide enough so that we are sure that a single one will work). For example, if we want the hosts 192.168.2.1, 192.168.2.2 and 192.168.2.3 for our host, we should specify the preceding property in the following way:

discovery.zen.ping.unicast.hosts: 192.168.2.1:9300, 192.168.2.2:9300, 192.168.2.3:9300

One can also define a range of the ports Elasticsearch can use. For example, to say that ports from 9300 to 9399 can be used, we specify the following:

discovery.zen.ping.unicast.hosts: 192.168.2.1:[9300-9399], 192.168.2.2:[9300-9399], 192.168.2.3:[9300-9399]

Note that the hosts are separated with a comma character and we've specified the port on which we expect unicast messages.

Fault detection ping settings

In addition to the settings discussed previously, we can also control or alter the default ping configuration. Ping is a signal sent between the nodes to check if they are running and responsive. The master node pings all the other nodes in the cluster and each of the other nodes in the cluster pings the master node. The following properties can be set:

  • discovery.zen.fd.ping_interval: This defaults to 1s (one second) and specifies how often the nodes ping each other
  • discovery.zen.fd.ping_timeout: This defaults to 30s (30 seconds) and defines how long a node will wait for the response to its ping message before considering a node as unresponsive
  • discovery.zen.fd.ping_retries: This defaults to 3 and specifies how many retries should be taken before considering a node as not working

If you experience some problems with your network, or you know that your nodes need more time to see the ping response, you can adjust the preceding values to the ones that are good for your deployment.

Cluster state updates control

As we have already discussed, the master node is the one responsible for handling the changes of the cluster state and Elasticsearch allows us to control that process. For most use cases, the default settings are more than enough, but you may run into situations where changing the settings is required.

The master node processes a single cluster state command at a time. First the master node propagates the changes to other nodes and then it waits for response. Each cluster state change is not considered finished until enough nodes respond to the master with acknoledgment. The number of nodes that need to respond is specified by discovery.zen.minimum_master_nodes, which we are already aware of. The maximum time an Elasticsearch node waits for the nodes to respond is 30s by default and is specified by the discovery.zen.commit_timeout property. If not enough nodes respond to the master, the cluster state change is rejected.

Once enough nodes respond to the master publish message, the cluster state change is accepted on the master and the cluster state is changed. Once that is done, the master sends a message to all the nodes saying that the change can be applied. The timeout of this message is again set to 30 seconds and is controlled using the discovery.zen.publish_timeout property.

Dealing with master unavailability

If a cluster has no master node, whatever the reason may be, it is not fully operational. By default, we can't change the metadata, cluster wide commands will not be working, and so on. Elasticsearch allows us to configure the behavior of the nodes when the master node is not elected. To do that, we can use the discovery.zen.no_master_block property which the settings of all and write. Setting this property to all means that all the operations on the node will be rejected, that is, the search operations, the write related operations, and the cluster wide operations such as health or mappings retrieval. Setting this property to write means that only the write operation will be rejected – this is the default behavior of Elasticsearch.

Adjusting HTTP transport settings

While discussing the node discovery module and process, we mentioned the HTTP module a few times. We would like to get back to that topic now and discuss a few useful properties when discussing and using Elasticsearch.

Disabling HTTP

The first thing is disabling the HTTP completely. This is useful to ensure that the master and data nodes won't accept any queries or requests in general from users. To disable the HTTP transport completely, we just need to add the http.enabled property and set it to false in our elasticsearch.yml file.

HTTP port

Elasticsearch allows us to define the port on which it will be listening to HTTP requests. This is done by using the http.port property. It defaults to 9200-9300, which means that Elasticsearch will start from 9200 port and increase if the port is not available (so the next instance will use 9201 port, and so on). There is also http.publish_port, which is very useful when running Elasticsearch behind a firewall and when the HTTP port is not directly accessible. It defines which port should be used by the clients connecting to Elasticsearch and defaults to the same value as the http.port property.

HTTP host

We can also define the host to which Elasticsearch will bind. To specify it, we need to define the http.host property. The default value is the one set by the network module. If needed, we can set the publish host and the bind host separately using the http.publish_host and http.bind_host properties. You usually don't have to specify these properties unless your nodes have non standard host names or multiple names and you want Elasticsearch to bind to a single one only.

You can find the full list of properties allowed for the HTTP module in Elasticsearch official documentation available at https://www.elastic.co/guide/en/elasticsearch/reference/2.2/modules-http.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset