In the previous chapter, we focused on Elasticsearch nodes and cluster configuration. We started by discussing the node discovery process, what it is and how to configure it. We've discussed gateway and recovery modules and tuned them to match our needs. We've used templates and dynamic templates to manage data structure easily and learned how to install plugins to extend the functionalities of Elasticsearch. Finally, we've learned about the caches of Elasticsearch and how to update indices and cluster settings using a dedicated API. By the end of this chapter, you will have learned the following topics:
A good piece of software is a one that can manage exceptional situations such as hardware failure or human error. Even though a cluster of a few servers is less dependent on hardware problems, bad things can still happen. For example, let's imagine that you need to restore your indices. One possible solution is to reindex all your data from a primary data store such as a SQL database. But what will you do if it takes too long or, even worse, the only data store is Elasticsearch? Before Elasticsearch 1.0, creating backups of indices was not easy. The procedure included stopping indexation, flushing the data to disk, shutting down the cluster, and, finally, copying the data to a backup device.
Fortunately, now we can take snapshots and this section will guide you and show how this functionality works.
A snapshot keeps all the data related to the cluster from the time the snapshot creation starts and it includes information about the cluster state and indices. Before we create snapshots, at least the first one, a snapshot repository must be created. Each repository is recognized by its name and should define the following aspects:
name
: A unique name of the repository; we will need it later.type
: The type of the repository. The possible values are fs
(a repository on a shared file system) and url
(a read-only repository available via URL)settings
: Additional information needed depending on the repository
typeNow, let's create a file system repository. Before this, we have to make sure that the directory for our backups fulfils two requirements. The first is related to security. Every repository has to be placed in the path defined in the Elasticsearch configuration file as path.repo
. For example, our elasticsearch.yml
includes a line similar to the following one:
path.repo: ["/tmp/es_backup_folder", "/tmp/backup/es"]
The second requirement says that every node in the cluster should be able to access the directory we set for the repository.
So now, let's create a new file system repository by running the following command:
curl -XPUT localhost:9200/_snapshot/backup -d '{ "type": "fs", "settings": { "location": "/tmp/es_backup_folder/cluster1" } }'
The preceding command creates a repository named backup
, which stores the backup files in the directory given by the location
attribute. Elasticsearch responds with the following information:
{"acknowledged":true}
At the same time, es_backup_folder
on the local file system is created—without any content yet.
As we said, the second repository type is url
. It requires a url
parameter instead of the location, which points to the address where the repository resides, for example, the HTTP address. As in the previous case, the address should be defined in the repositories.url.allowed_urls
parameter in the Elasticsearch configuration. The parameter allows the use of wildcards in the address.
You can also store snapshots in Amazon S3, HDFS, or Azure using the additional plugins available. To learn about these, please visit the following pages:
Now that we have our first repository, we can see its definition using the following command:
curl -XGET localhost:9200/_snapshot/backup?pretty
We can also check all the repositories by running a command like the following:
curl -XGET localhost:9200/_snapshot/_all?pretty
Or simply, we can use this:
curl -XGET localhost:9200/_snapshot/_all?pretty
curl -XGET localhost:9200/_snapshot/?pretty
If you want to delete a snapshot repository, the standard DELETE
command helps:
curl -XDELETE localhost:9200/_snapshot/backup?pretty
By default, Elasticsearch takes all the indices and cluster settings (except the transient ones) when creating snapshots. You can create any number of snapshots and each will hold information available right from the time when the snapshot was created. The snapshots are created in a smart way; only new information is copied. This means that Elasticsearch knows which segments are already stored in the repository and doesn't have to save them again.
To create a new snapshot, we need to choose a unique name and use the following command:
curl -XPUT 'localhost:9200/_snapshot/backup/bckp1'
The preceding command defines a new snapshot named bckp1
(you can only have one snapshot with a given name; Elasticsearch will check its uniqueness) and data is stored in the previously defined backup repository. The command returns an immediate response, which looks as follows:
{"accepted":true}
The preceding response means that the process of snapshot-ing has started and continues in the background. If you would like the response to be returned only when the actual snapshot is created, you can add the wait_for_completion=true
parameter as shown in the following example:
curl -XPUT 'localhost:9200/_snapshot/backup/bckp2?wait_for_completion=true&pretty'
The response to the preceding command shows the status of a created snapshot:
{ "snapshot" : { "snapshot" : "bckp2", "version_id" : 2000099, "version" : "2.2.0", "indices" : [ "news" ], "state" : "SUCCESS", "start_time" : "2016-01-07T21:21:43.740Z", "start_time_in_millis" : 1446931303740, "end_time" : "2016-01-07T21:21:44.750Z", "end_time_in_millis" : 1446931304750, "duration_in_millis" : 1010, "failures" : [ ], "shards" : { "total" : 5, "failed" : 0, "successful" : 5 } } }
As you can see, Elasticsearch presents information about the time taken by the snapshot-ing process, its status, and the indices affected.
The snapshot command also accepts the following additional parameters:
indices
: The names of the indices of which we want to take snapshots.ignore_unavailable
: When this is set to false
(the default), Elasticsearch will return an error if any index listed using the indices parameter is missing. When set to true, Elasticsearch will just ignore the missing indices during backup.include_global_state
: When this is set to true
(the default), the cluster state is also written to the snapshot (except for the transient settings).partial
: The snapshot operation success depends on the availability of all the shards. If any of the shards is not available, the snapshot operation will fail. Setting partial to true causes Elasticsearch to save only the available shards and omit the lost ones.An example of using additional parameters can look as follows:
curl -XPUT 'localhost:9200/_snapshot/backup/bckp3?wait_for_completion=true&pretty' -d '{ "indices": "b*", "include_global_state": "false" }'
Now that we have our snapshots done, we will also learn how to restore data from a given snapshot. As we said earlier, a snapshot can be addressed by its name. We can list all the snapshots using the following command:
curl -XGET 'localhost:9200/_snapshot/backup/_all?pretty'
The response returned by Elasticsearch to the preceding command shows the list of all available backups. Every list item is similar to the following:
{ "snapshot" : { "snapshot" : "bckp2", "version_id" : 2000099, "version" : "2.2.0", "indices" : [ "news" ], "state" : "SUCCESS", "start_time" : "2016-01-07T21:21:43.740Z", "start_time_in_millis" : 1446931303740, "end_time" : "2016-01-07T21:21:44.750Z", "end_time_in_millis" : 1446931304750, "duration_in_millis" : 1010, "failures" : [ ], "shards" : { "total" : 5, "failed" : 0, "successful" : 5 } } }
The repository we created earlier is called backup
. To restore a snapshot named bckp1
from our snapshot repository, run the following command:
curl -XPOST 'localhost:9200/_snapshot/backup/bckp1/_restore'
During the execution of this command, Elasticsearch takes the indices defined in the snapshot and creates them with the data from the snapshot. However, if the index already exists and is not closed, the command will fail. In this case, you may find it convenient to only restore certain indices, for example:
curl -XPOST 'localhost:9200/_snapshot/backup/bckp1/_restore?pretty' -d '{ "indices": "c*"}'
The preceding command restores only the indices that begin with the letter c
. The other available parameters are as follows:
ignore_unavailable
: This parameter when set to false
(the default behavior), will cause Elasticsearch to fail the restore process if any of the expected indices is not available.include_global_stat
e: This parameter when set to true
will cause Elasticsearch to restore the global state included in the snapshot, which is also the default behavior.rename_pattern
: This parameter allows the renaming of the index during a restore operation. Thanks to this, the restored index will have a different name. The value of this parameter is a regular expression that defines the source index name. If a pattern matches the name of the index, name substitution will occur. In the pattern, you should use groups limited by parentheses used in the rename_replacement
parameter.rename_replacement
: This parameter along with rename_pattern
defines the target index name. Using the dollar sign and number, you can recall the appropriate group from rename_pattern
.For example, due to rename_pattern=products_(.*)
, only the indices with names that begin with products_
will be restored. The rest of the index name will be used during replacement. rename_pattern=products_(.*)
together with rename_replacement=items_$1
causes the products_cars
index to be restored to an index called items_cars
.
Elasticsearch leaves snapshot repository management up to you. Currently, there is no automatic clean-up process. But don't worry; this is simple. For example, let's remove our previously taken snapshot:
curl -XDELETE 'localhost:9200/_snapshot/backup/bckp1?pretty'
And that's all. The command causes the snapshot named bckp1
from the backup
repository to be deleted.