Backing up

One of the most important tasks for the administrator is to make sure that no data will be lost in the case of a system failure. Elasticsearch, in its assumptions, is a resistant and well-configured cluster of nodes and can survive even a few simultaneous disasters. However, even the most properly configured cluster is vulnerable to network splits and network partitions, which in some very rare cases can result in data corruption or loss. In such cases, being able to get data restored from the backup is the only solution that can save us from recreating our indices. You probably already know what we want to talk about: the snapshot / restore functionality provided by Elasticsearch. However, as we said earlier, we don't want to repeat ourselves—this is a book for more advanced Elasticsearch users, and basics of the snapshot and restore API were already described in Elasticsearch Server Second Edition by Packt Publishing and in the official documentation. Now, we want to focus on the functionalities that were added after the release of Elasticsearch 1.0 and thus omitted in the previous book—let's talk about the cloud capabilities of the Elasticsearch backup functionality.

Saving backups in the cloud

The central concept of the snapshot / restore functionality is a repository. It is a place where the data—our indices and the related meta information—is safely stored (assuming that the storage is reliable and highly available). The assumption is that every node that is a part of the cluster has access to the repository and can both write to it and read from it. Because of the need for high availability and reliability, Elasticsearch, with the help of additional plugins, allows us to push our data outside of the cluster—to the cloud. There are three possibilities where our repository can be located, at least using officially supported plugins:

  • The S3 repository: Amazon Web Services
  • The HDFS repository: Hadoop clusters
  • The Azure repository: Microsoft's cloud platform

Because we didn't discuss any of the plugins related to the snapshot / restore functionality, let's get through them to see where we can push our backup data.

The S3 repository

The S3 repository is a part of the Elasticsearch AWS plugin, so to use S3 as the repository for snapshotting, we need to install the plugin first:

bin/plugin -install elasticsearch/elasticsearch-cloud-aws/2.4.0

After installing the plugin on every Elasticsearch node in the cluster, we need to alter their configuration (the elasticsearch.yml file) so that the AWS access information is available. The example configuration can look like this:

cloud:
  aws:         
    access_key: YOUR_ACCESS_KEY
    secret_key: YOUT_SECRET_KEY

To create the S3 repository that Elasticsearch will use for snapshotting, we need to run a command similar to the following one:

curl -XPUT 'http://localhost:9200/_snapshot/s3_repository' -d '{
 "type": "s3",
 "settings": {
  "bucket": "bucket_name"
 }
}'

The following settings are supported when defining an S3-based repository:

  • bucket: This is the required parameter describing the Amazon S3 bucket to which the Elasticsearch data will be written and from which Elasticsearch will read the data.
  • region: This is the name of the AWS region where the bucket resides. By default, the US Standard region is used.
  • base_path: By default, Elasticsearch puts the data in the root directory. This parameter allows you to change it and alter the place where the data is placed in the repository.
  • server_side_encryption: By default, encryption is turned off. You can set this parameter to true in order to use the AES256 algorithm to store data.
  • chunk_size: By default, this is set to 100m and specifies the size of the data chunk that will be sent. If the snapshot size is larger than chunk_size, Elasticsearch will split the data into smaller chunks that are not larger than the size specified in chunk_size.
  • buffer_size: The size of this buffer is set to 5m (which is the lowest possible value) by default. When the chunk size is greater than the value of buffer_size, Elasticsearch will split it into buffer_size fragments and use the AWS multipart API to send it.
  • max_retries: This specifies the number of retries Elasticsearch will take before giving up on storing or retrieving the snapshot. By default, it is set to 3.

In addition to the preceding properties, we are allowed to set two additional properties that can overwrite the credentials stored in elasticserch.yml, which will be used to connect to S3. This is especially handy when you want to use several S3 repositories—each with its own security settings:

  • access_key: This overwrites cloud.aws.access_key from elasticsearch.yml
  • secret_key: This overwrites cloud.aws.secret_key from elasticsearch.yml

The HDFS repository

If you use Hadoop and its HDFS (http://wiki.apache.org/hadoop/HDFS) filesystem, a good alternative to back up the Elasticsearch data is to store it in your Hadoop cluster. As with the case of S3, there is a dedicated plugin for this. To install it, we can use the following command:

bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.0.2

Note that there is an additional plugin version that supports Version 2 of Hadoop. In this case, we should append hadoop2 to the plugin name in order to be able to install the plugin. So for Hadoop 2, our command that installs the plugin would look as follows:

bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.0.2-hadoop2

There is also a lite version that can be used in a situation where Hadoop is installed on the system with Elasticsearch. In this case, the plugin does not contain Hadoop libraries and are already available to Elasticsearch. To install the lite version of the plugin, the following command can be used:

bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.0.2-light

After installing the plugin on each Elasticsearch (no matter which version of the plugin was used) and restarting the cluster, we can use the following command to create a repository in our Hadoop cluster:

curl -XPUT 'http://localhost:9200/_snapshot/hdfs_repository' -d '{
 "type": "hdfs"
 "settings": {
  "path": "snapshots"
 }
}'

The available settings that we can use are as follows:

  • uri: This is the optional parameter that tells Elasticsearch where HDFS resides. It should have a format like hdfs://HOST:PORT/.
  • path: This is the information about the path where snapshot files should be stored. It is a required parameter.
  • load_default: This specifies whether the default parameters from the Hadoop configuration should be loaded and set to false if the reading of the settings should be disabled.
  • conf_location: This is the name of the Hadoop configuration file to be loaded. By default, it is set to extra-cfg.xml.
  • chunk_size: This specifies the size of the chunk that Elasticsearch will use to split the snapshot data; by default, it is set to 10m. If you want the snapshotting to be faster, you can use smaller chunks and more streams to push the data to HDFS.
  • conf.<key>: This is where key is any Hadoop argument. The value provided using this property will be merged with the configuration.
  • concurrent_streams: By default, this is set to 5 and specifies the number of concurrent streams used by a single node to write and read to HDFS.

The Azure repository

The last of the repositories we wanted to mention is Microsoft's Azure cloud. Just like Amazon S3, we are able to use a dedicated plugin to push our indices and metadata to Microsoft cloud services. To do this, we need to install a plugin, which we can do by running the following command:

bin/plugin -install elasticsearch/elasticsearch-cloud-azure/2.4.0

The configuration is also similar to the Amazon S3 plugin configuration. Our elasticsearch.yml file should contain the following section:

cloud:
  azure:         
    storage_account: YOUR_ACCOUNT
    storage_key: YOUT_SECRET_KEY

After Elasticsearch is configured, we need to create the actual repository, which we do by running the following command:

curl -XPUT 'http://localhost:9200/_snapshot/azure_repository' -d '{
 "type": "azure"
}'

The following settings are supported by the Elasticsearch Azure plugin:

  • container: As with the bucket in Amazon S3, every piece of information must reside in the container. This setting defines the name of the container in the Microsoft Azure space. The default value is elasticserch-snapshots.
  • base_path: This allows us to change the place where Elasticsearch will put the data. By default, Elasticsearch puts the data in the root directory.
  • chunk_size: This is the maximum chunk size used by Elasticsearch (set to 64m by default, and this is also the maximum value allowed). You can change it to change the size when the data should be split into smaller chunks.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset