We can use Amazon Elastic MapReduce to start an Apache HBase cluster on the Amazon infrastructure to store large quantities of data in column oriented data store. We can use the data stored on Amazon EMR HBase clusters as input and output of EMR MapReduce computations as well. We can incrementally back up the data stored in Amazon EMR HBase clusters to Amazon S3 for data persistency. We can also start an EMR HBase cluster by restoring the data from a previous S3 backup.
In this recipe, we start an Apache HBase cluster on Amazon EC2 cloud using Amazon EMR; perform several simple operations on the newly created HBase cluster and backup the HBase data in to Amazon S3 before shutting down the cluster. Then we start a new HBase cluster restoring the HBase data backups from the original HBase cluster.
You should have the Amazon EMR Command Line Interface (CLI) installed and configured to manually back up HBase data. Refer to the Creating an Amazon EMR job flow using the Command Line Interface recipe in this chapter for more information on installing and configuring the EMR CLI.
The following steps show how to deploy an Apache HBase cluster on Amazon EC2 using Amazon EMR:
c10-data
.If you do not have a usable key pair, go to the EC2 console (https://console.aws.amazon.com/ec2) to create a key pair. To create a key pair, log in to the EC2 dashboard, select a region and click on Key Pairs under the Network and Security menu. Click on the Create Key Pair button in the Key Pairs window and provide a name for the new key pair. Download and save the private key file (PEM format) in to a safe location.
The following steps show you how to connect to the master node of the deployed HBase cluster to start the HBase shell.
> ssh -i ec2.pem [email protected]
hbase shell
command. Create the table named test
in your HBase installation and insert a sample entry to the table using the put
command. Use the scan
command to view the contents of the table.> hbase shell ......... hbase(main):001:0> create ''test'',''cf'' 0 row(s) in 2.5800 seconds hbase(main):002:0> put ''test'',''row1'',''cf:a'',''value1'' 0 row(s) in 0.1570 seconds hbase(main):003:0> scan ''test'' ROW COLUMN+CELL row1 column=cf:a, timestamp=1347261400477, value=value1 1 row(s) in 0.0440 seconds hbase(main):004:0> quit
The following step will back up the data stored in an Amazon EMR HBase cluster.
j-FDMXCBZP9P85
) from the EMR console. Replace <job_flow_name>
using the retrieved job flow name. Change the backup directory path (s3://c10-data/hbase2
) according to your backup data blob.> ./elastic-mapreduce --jobflow <job_flow_name> --hbase-backup --backup-dir s3://c10-data/hbase-manual
Now, we will start a new Amazon EMR HBase cluster by restoring data from a backup.
list
command to list the set tables in HBase and the scan test
command to view the contents of the test table.> hbase shell ......... hbase(main):001:0> list TABLE test 1 row(s) in 1.4870 seconds hbase(main):002:0> scan ''test'' ROW COLUMN+CELL row1 column=cf:a, timestamp=1347318118294, value=value1 1 row(s) in 0.2030 seconds