Your ElasticSearch time machine

Apart from our indices and the data indexed inside them, ElasticSearch needs to hold the metadata, which can be of the type mappings, index-level settings, and all of that. All that information needs to be stored somewhere in order to be safe when the whole cluster restarts. It is due to these needs that ElasticSearch introduced the gateway module. You can think about it as a safe haven for your cluster data and metadata. Each time you start your cluster, all the needed data is read from the gateway, and when you make a change to your cluster it is persisted using the gateway module.

The gateway module

ElasticSearch allows us to use different gateway types, which we will discuss in a moment. In order to set the type of gateway we want to use, we need to add the gateway.type property to the elasticsearch.yml configuration file and set it to one of the following values:

  • local: This specifies a local gateway
  • fs: This specifies a shared filesystem gateway
  • hdfs: This specifies the Hadoop distributed filesystem gateway
  • s3: This specifies the Amazon s3 gateway

Local gateway

This default and recommended gateway type stores the indices and their metadata in the local filesystem. In order to use this type of gateway, one should set the gateway.type property to local in the elasticsearch.yml configuration file.

Compared to other gateways, the write operation to this gateway is not performed in an asynchronous way. So whenever a write succeeds, you can be sure that the data was written into the gateway (basically indexed or stored into the transaction log).

Shared filesystem gateway

The shared filesystem gateway stores the information about indices and metadata in a shared, distributed filesystem that is accessible by all ElasticSearch nodes in the cluster. In order to use this type of gateway, one should set the gateway.type property to fs in the elasticsearch.yml configuration file. In addition to that, we need to set the gateway.fs.location property, which will inform ElasticSearch where the shared filesystem is located. ElasticSearch will append the cluster name to the value provided by the gateway.fs.location property.

The following example uses the shared filesystem gateway:

gateway.type: fs
gateway.fs.location: /shared/elasticsearch/gateway/

In addition to the required properties one can also change the gateway.fs.concurrent_streams property (which defaults to 5), which controls how many concurrent streams are used in order to perform the snapshotting operation. Snapshotting is a process of writing changes that need to be applied with the gateway (such as metadata changes). If you want to have your cluster recover faster from the gateway, increase the value of that property (however, you have to remember that it will add more pressure on the CPU and I/O). If you want to reduce the pressure on the nodes during recovery, decrease the value of this property.

Hadoop distributed filesystem gateway

The Hadoop distributed filesystem gateway type stores all the needed information in the HDFS filesystem. In order to use this type of gateway, one should set the gateway.type property to hdfs in the elasticsearch.yml configuration file. In addition to that, we need to set the following properties:

  • gateway.hdfs.uri: This is the URI of the Hadoop cluster
  • gateway.hdfs.path: This is the path where the data will be stored in HDFS

For example, if we have our Hadoop cluster available at hfds://10.1.2.1:8022 and we want to store the data in /elasticsearch, we would need to place the following configuration entries in elasticsearch.yml:

gateway.type: hdfs
gateway.hdfs.uri: hdfs://10.1.2.1:8022
gateway.hdfs.path: /elasticsearch

In addition to the required properties, one can also change the gateway.hdfs.concurrent_streams property (which defaults to 5) that controls how many concurrent streams are used in order to perform the snapshotting operation.

Plugin needed

In order to use the hdfs gateway type, one needs to install an appropriate plugin—the elasticsearch-hadoop plugin. You can learn more about installing plugins in the Installing plugins topic at the end of the current chapter.

Amazon s3 gateway

The Amazon s3 gateway type stores all the needed information in the Amazon s3 filesystem. In order to use this type of gateway, one should set the gateway.type property to s3 in the elasticsearch.yml configuration file. In addition to that we can set the following properties:

  • gateway.s3.bucket: This is the name of the s3 bucket
  • gateway.s3.chunk_size: This is the size of a file chunk (defaults to 100m)

We also need to add information about the Amazon Web Services (AWS) authentication and region. In order to do that one should add the following properties:

  • cloud.aws.access_key: The AWS access key
  • cloud.aws.secret_key: The secret key to your AWS
  • cloud.aws.region: The region (the available values are: us-east-1, us-west-1, ap-southeast-1, and eu-west-1)

This is how the full configuration could look (this should be placed in the elasticsearch.yml configuration file):

gateway.type: s3
gatewat.type.s3.bucket: elasticsearch-cluster-bucket
cloud.aws.access_key: JDSHcnzjhydASDI
cloud.aws.secret_key: NcxnbdJHDSY/r/sdda8273=+_SAD
cloud.aws.region: eu-west-1

In addition to the mandatory properties, one can also change the gateway.s3.concurrent_streams property (which defaults to 5) that controls how many concurrent streams are used in order to perform the snapshotting operation.

Plugin needed

In order to use the s3 gateway type, one needs to install an appropriate plugin—the cloud-aws plugin. You can learn more about installing plugins in the Installing plugins section at the end of this chapter.

Note

Please note that from ElasticSearch 0.20, all gateway types, except the local one, are deprecated, which means that they will be removed in the future. ElasticSearch creators plan to introduce a proper backup and restore API that should be available in the near future.

Recovery control

In addition to choosing the gateway type and configuring type-specific properties, ElasticSearch lets us configure when to start the initial recovery process. The recovery is a process of initializing all the shards and replicas, reading all the data from the transaction log, and applying it on the shards. Basically it's a process needed to start ElasticSearch.

For example, let's imagine that we have a cluster that consists of 10 ElasticSearch nodes. We should inform ElasticSearch about the number of nodes by setting the gateway.expected_nodes to that value: 10 in our case. We inform ElasticSearch about the number of expected nodes that are eligible to hold the data and be selected as a master. ElasticSearch will start the recovery process immediately if the number of nodes in the cluster is equal to that property.

We would also like to start the recovery after eight nodes for the cluster. In order to do that, we should set the gateway.recover_after_nodes property to 8. We could set the value to any value we like. However, we should set it to a value that ensures that the newest version of the cluster state snapshot will be available, which usually means that you should start recovery when most of your nodes are available.

However, there is also one thing—we would like the gateway recovery process to start 10 minutes after the cluster was formed, so we set the gateway.recover_after_time property to 10m. This property tells the gateway module how long to wait with the recovery after the number of nodes specified by the gateway.recover_after_nodes property have formed the cluster. We may want to do that because we know that our network is quite slow and we want the nodes' communication to be stable.

The previously mentioned property's values should be set in the elasticsearch.yml configuration file. If we wanted to have this value in the mentioned file, we would end up with the following section in the file:

gateway.recover_after_nodes: 8
gateway.recover_after_time: 10m
gateway.expected_nodes: 10
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset