Connecting to your EMR cluster

Once you have provisioned the EMR cluster, you should see its state change from Starting to Bootstrapping to finally into a Running state. If you do not have any jobs currently executing, then your cluster may go into a Waiting state as well. Here, you can now start using the EMR cluster for running your various jobs and analysis. But before that, here's a quick introduction of a few ways in which you can connect to your running EMR cluster.

First up, connecting to the master node using a simple SSH. Connecting to the master node via SSH can be used for monitoring the cluster, viewing Hadoop's log flies or for even running an interactive shell for Hive or Pig programming:

  1. To do so, log in to your Amazon EMR dashboard and select your newly created cluster's name from the Cluster list page. This will display the clusters Details page where you can manage, as well as monitor your cluster.
  2. Next, copy the Master public DNS address. Once copied, open up a PuTTY Terminal and paste the copied public DNS in the Host Name (or IP Address) field.
  3. Convert the key pair that you associated with this EMR cluster into a private key and attach that private key in PuTTY by selecting the Auth option present under the SSH section.
  4. Once done, click on Open to establish the connection. At the certificate dialog, accept the certificate and type in Hadoop as the username when prompted. You should get SSH access into your cluster's master node now!

The same task can be performed using the AWS CLI as well:

  1. From the Terminal, first type in the following command to retrieve the running cluster's ID. The cluster's ID will be in this format j-XXXXXXXX:
# aws emr list-clusters 
  1. To list the instances running in your cluster, use the cluster ID obtained from the previous command's output in the following command:
# aws emr list-instances --cluster-id <CLUSTER_ID>
Copy the PublicDnsName value from the output of this command. You can then use the following set of commands to get access to your master node.
  1. Ensure that the cluster's private key has the necessary permissions:
# chmod 400 <PRIVATEKEY.pem>
  1. Once done, SSH to the master node using the following command:
# ssh hadoop@<PUBLIC_DNS_NAME> -i <PRIVATEKEY.pem>

You can additionally connect to the various application web interfaces, such as Hue or the Hadoop HDFS NameNode, using a few simple steps:

  1. To get started, you will once again require the public DNS name of your master node. You can obtain that from the EMR dashboard or by using the CLI steps we just walked through.
  2. Next, using PuTTY , paste the public DNS name in the Host Name (or IP Address) field as done earlier. Browse and load the private key using the Auth option as well.
  3. Under the SSH option from PuTTY's navigation pane, select Tunnels.
  4. Fill in the required details as mentioned in the following list:
    • Set source port field to 8157
    • Enable the Dynamic and Auto options
  5. Once completed, select Add and finally Open the connection.

This form of tunnelling or port forwarding is essential as the web interfaces can only be viewed from the master node's local web server. Once completed, launch your favorite browser and view the respective web interfaces, as given here:

  • For accessing Hue, type in the following in your web browser:
http://<PUBLIC_DNS_NAME>:8888/
  • For accessing the Hadoop HDFS NameNode, type in the following:
http:// <PUBLIC_DNS_NAME>::50070/

You can even use the CLI to create a tunnel. To do so, substitute the public DNS name and the private key values in the following command:

# ssh -i <PRIVATEKEY.pem> -N -D 8157 hadoop@<PUBLIC_DNS_NAME> 
The -D flag indicates that the port forwarding is dynamic.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset