Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR

We can use Amazon Elastic MapReduce to start an Apache HBase cluster on the Amazon infrastructure to store large quantities of data in column oriented data store. We can use the data stored on Amazon EMR HBase clusters as input and output of EMR MapReduce computations as well. We can incrementally back up the data stored in Amazon EMR HBase clusters to Amazon S3 for data persistency. We can also start an EMR HBase cluster by restoring the data from a previous S3 backup.

In this recipe, we start an Apache HBase cluster on Amazon EC2 cloud using Amazon EMR; perform several simple operations on the newly created HBase cluster and backup the HBase data in to Amazon S3 before shutting down the cluster. Then we start a new HBase cluster restoring the HBase data backups from the original HBase cluster.

Getting ready

You should have the Amazon EMR Command Line Interface (CLI) installed and configured to manually back up HBase data. Refer to the Creating an Amazon EMR job flow using the Command Line Interface recipe in this chapter for more information on installing and configuring the EMR CLI.

How to do it...

The following steps show how to deploy an Apache HBase cluster on Amazon EC2 using Amazon EMR:

  1. Create a S3 bucket to store the HBase backups. We assume the S3 bucket for the HBase data backups as c10-data.
  2. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create New Job Flow button to create a new EMR MapReduce job flow. Provide a name for your job flow. Select the Run your own application option under the Create a Job Flow. Select the HBase option from the drop-down menu below that. Click on Continue.
  3. Configure your Apache HBase cluster in the Specify Parameters tab. Select No for the Restore from Backup option. Select Yes for the Schedule Regular Backups and Consistent Backup options. Specify Backup Frequency for automatic schedules incremental data backups and provide a path inside the Blob we created in step 1 as the Backup Location. Click on Continue.
    How to do it...
  4. Select a key pair in the Amazon EC2 Key Pair drop-down box. Make sure you have the private key for the selected EC2 key pair downloaded in your computer.

    Note

    If you do not have a usable key pair, go to the EC2 console (https://console.aws.amazon.com/ec2) to create a key pair. To create a key pair, log in to the EC2 dashboard, select a region and click on Key Pairs under the Network and Security menu. Click on the Create Key Pair button in the Key Pairs window and provide a name for the new key pair. Download and save the private key file (PEM format) in to a safe location.

  5. Configure the EC2 instances for the job flow and configure the log paths for the MapReduce computations in the next two tabs. Note that Amazon EMR does not support the use of EC2 Small and Medium instances with HBase clusters. Click on Continue in Bootstrap Options. Review your job flow in the Review tab and click on Create Job Flow to launch instances and to create your Apache HBase cluster.

    Note

    Amazon will charge you for the compute and storage resources you use by clicking Create Job Flow in the above step. Refer to the Saving money by using EC2 Spot Instances recipe to find out how you can save money by using Amazon EC2 Spot Instances.

The following steps show you how to connect to the master node of the deployed HBase cluster to start the HBase shell.

  1. Go to the Amazon EMR console (https://console.aws.amazon.com/elasticmapreduce). Select the job flow for the HBase cluster to view more information about the job flow.
  2. Retrieve the Master Public DNS Name value from the information pane.
    How to do it...
  3. Use the Master Public DNS Name and the EC2 PEM-based key (selected in step 4) to connect to the master node of the HBase cluster.
    > ssh -i ec2.pem [email protected]
    
  4. Start the HBase shell using the hbase shell command. Create the table named test in your HBase installation and insert a sample entry to the table using the put command. Use the scan command to view the contents of the table.
    > hbase shell
    .........
    
    hbase(main):001:0> create ''test'',''cf''
    0 row(s) in 2.5800 seconds
    
    hbase(main):002:0> put ''test'',''row1'',''cf:a'',''value1''
    0 row(s) in 0.1570 seconds
    
    hbase(main):003:0> scan ''test''
    ROW                   COLUMN+CELL                                              
     row1                 column=cf:a, timestamp=1347261400477, value=value1       
    1 row(s) in 0.0440 seconds
    
    hbase(main):004:0> quit
    

    The following step will back up the data stored in an Amazon EMR HBase cluster.

  5. Execute the following command using the Amazon EMR CLI to manually backup the data stored in an EMR HBase cluster. Retrieve the job flow name (j-FDMXCBZP9P85) from the EMR console. Replace <job_flow_name> using the retrieved job flow name. Change the backup directory path (s3://c10-data/hbase2) according to your backup data blob.
    > ./elastic-mapreduce --jobflow <job_flow_name> --hbase-backup --backup-dir s3://c10-data/hbase-manual
    
  6. Select the job flow in the EMR console and click on Terminate.

    Now, we will start a new Amazon EMR HBase cluster by restoring data from a backup.

  7. Create a new job flow by clicking on Create New Job Flow button in the EMR console. Provide a name for your job flow. Select the Run your own application option under Create a Job Flow. Select the HBase option from the drop-down menu below that. Click on Continue.
  8. Configure EMR HBase cluster to restore data from the previous data backup in the Specify Parameters tab. Select Yes for the Restore from Backup option and provide the backup directory path you used in step 9 in the Backup Location textbox. Select Yes for the Schedule Regular Backups and Consistent Backup options. Specify Backup Frequency for automatic schedules incremental data backups and provide a path inside the Blob we created in step 1 as the Backup Location. Click on Continue.
    How to do it...
  9. Repeat steps 4, 5, 6, and 7.
  10. Start the HBase shell by logging to the master node of the new HBase cluster. Use the list command to list the set tables in HBase and the scan test command to view the contents of the test table.
    > hbase shell
    .........
    
    hbase(main):001:0> list
    TABLE                                                                          
    test                                                                           
    1 row(s) in 1.4870 seconds
    
    hbase(main):002:0> scan ''test''
    ROW                   COLUMN+CELL                                              
     row1                 column=cf:a, timestamp=1347318118294, value=value1       
    1 row(s) in 0.2030 seconds
    
  11. Terminate your job flow using the EMR console, by selecting the job flow and clicking on the Terminate button.

See also

  • The Installing HBase recipe of Chapter 5, Hadoop Eco-System and Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment recipe on this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset