Getting started with Amazon EMR

With the basics covered, in this section we will be working with the Amazon EMR dashboard to create our very first cluster. However, before we get going, here's a small list of prerequisite steps that we need to complete first.

To begin with, we will need to create an Amazon S3 bucket that will be used to store the output, logs generated by EMR, as well as some additional script and software files:

  1. From the AWS Management Console, filter and select the Amazon S3 service by using the Filter option. Alternatively, launch the Amazon S3 dashboard by navigating to this URL: https://s3.console.aws.amazon.com/s3/.
  2. Next, select the Create bucket option. In the Create bucket wizard, provide a suitable Bucket name followed by the selection of an appropriate Region to create the bucket in. For this use case, the EMR cluster, as well as the S3 buckets, are created in the US East (Ohio) region, however you can select an alternative based on your requirements. Click on Next to continue with the process.
  3. On the Set properties page, you can optionally choose to provide some tags for your bucket for cost allocations and tracking purposes. Click Next to continue.
  4. In the Set permissions page, ensure that the no public read access is granted to the bucket. Click on Next to review the settings and finally, select Create bucket to complete the process.
  5. Once the bucket is created, use the Create folder option to create dedicated folders for storing the logs, output, as well as some additional scripts that we might use in the near future. Here is a representational screenshot of the bucket after you have completed all of the previous steps:
  1. With the bucket created and ready for use, the next prerequisite item left to create is a key pair using which you can SSH into your EC2 instances. Ensure that the key pair is created in the same region (US East (Ohio) in this case) as your EMR cluster.

Now that the prerequisites are out of the way, we can finally get started with our EMR cluster setup!

  1. From the AWS Management Console, filter and select the Amazon EMR service by using the Filter option. Alternatively, launch the Amazon EMR dashboard by selecting this URL: https://us-east-2.console.aws.amazon.com/elasticmapreduce/home.
  1. Since this is the first time we've created an EMR cluster, select the Create cluster option to get started.
  2. You can configure your EMR cluster using two ways: a fast and easy Quick Options which is shown to you by default, and an Advanced options page where you can select and configure the individual items for your cluster. In this case, we will go ahead and select Go to advanced options.
  3. The Advanced options page provides us with a four-step wizard that essentially guides us to configuring a fully functional EMR cluster. To begin with, the first step is where you can select and customize the software that you wish to install on your EMR cluster.
  4. From the Release drop-down list, select the appropriate EMR release that you would like to work with. The latest version released as of writing this book is emr-5.11.1. Each release contains several distributed applications available for installation on your cluster. For example, selecting emr-5.11.1 which is a 2018 release, contains Hadoop v2.7.3, Flink v1.3.2, Ganglia v3.7.2, HBase v1.3.1, and many other such applications and software.
For a complete list of available EMR releases and their associated software versions, go to https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html.
  1. In this case, I have gone ahead and selected the basic applications that we will be requiring for this scenario, including Hadoop, Hive and Hue. Feel free to select other applications as per your requirements.
  1. The next couple of sections are optional, however, it is important to know their purpose:
    • AWS Glue Data Catalog settings: With EMR version 5.8.0 and above, you optionally have the choice to configure Spark SQL to use the AWS Glue Data Catalog (an external Hive table) as its metastore.
    • Edit software settings: You can use this option to override the default configuration settings for certain applications. This is achieved by providing a configuration object in the form of a JSON file. You can either Enter configuration or Load JSON from S3 as well:
    • Add steps: The final optional parameter left on the Software Configuration page is the add steps. As discussed briefly earlier in this chapter, steps are essentially a unit of work that we submit to the cluster. This can be something as trivial as loading input data from S3, or processing and running a MapReduce job on the data. We will be exploring steps a little more in detail a bit later in this chapter, so leave this field to its default value and select Next to continue with the process.
  1. The second step in the Advanced options wizard is configuring the cluster's hardware, or the instance configurations, as well as the cluster's networking.

EMR provides two options: instance fleets and instance groups; both explained briefly here:

    • Instance fleets: Instance fleets allows you to specify a target capacity for the instances present in a cluster. With this option, you get the widest variety of instance provisioning options where you can leverage mixed instance types for your nodes, and even go for different purchasing options for the same. With each instance fleet that created, you get to establish a target capacity for on-demand, as well as for spot instances.
You can have only one instance fleet per node type (master, core, task).
    • Instance groups: Instance groups on the other hand do not offer many custom configurable options per node type. In instance groups, each node consists of the same instance type and the same purchasing option, as well. Once these settings are configured during the cluster's creation, they cannot be altered; however, you can always add more instances as you see fit.
  1. For this particular use case, we are going to go ahead and select Uniform instance groups, as depicted in the following screenshot:
  1. Next, from the Network drop-down list, select the appropriate VPC in which you wish to launch your EMR cluster. You can alternatively choose to create a new VPC specifically for EMR, using the adjoining Create a VPC option.
  2. Similarly, select the appropriate subnet from the EC2 Subnet drop-down list.
  3. Finally, assign a value for the Root device EBS volume size that will be provisioned for each instance in the cluster. You can provide values between 10 GB and 100 GB.
  4. Using the edit options provided, you can additionally configure the Instance type, the Instance count as well as the Purchasing option for each node type, as depicted in the following screenshot. Note that these options are provided because we selected instance groups as our preferred mode of instance configurations. The options will vary if the Instance Fleet option is selected:
  1. You can additionally choose to enable autoscaling for the Core and Task nodes by selecting the Not enabled option under the Auto scaling column. Subsequently, you can add additional task instance groups by selecting the Add task instance group option, as well. Once done, select the Next option to proceed with the set up.
  2. The third step in the Advanced options provides general configurations that you can set, based on your requirements. To start off, provide a suitable Cluster name followed by selecting the Logging option for your EMR cluster. Use the folder option to browse to our newly created S3 bucket, as shown in the following screenshot:
  1. You can additionally enable the Termination protection option to prevent against accidental deletions of your cluster.
  2. Moving on, the final configuration item left on the cluster's General Options page is the Bootstrap Actions. Bootstrap actions as the name implies are certain scripts or code that you wish to execute on your cluster's instances at the time of booting up. This feature thus comes in very handy when you have to add new instances to an existing running cluster.
Bootstrap actions are executed using the Hadoop user by default. You can switch to root privileges by using the sudo command.

There are two types of Bootstrap actions that you can execute on your instances:

    • Run if: The Run if action executes an action when an instance-specific value is found in either the instance.json or the job-flow.json file. This is a predefined bootstrap action and comes in very handy when you only want to execute the action on a particular type of instance, for example, execute the bootstrap action only if the instance type is master.
    • Custom action: Custom actions leverage your own scripts to perform a customized bootstrap action.
  1. To create a bootstrap action, select the Configure and add option from the Add Bootstrap Action. Make sure the Run if action is selected before proceeding.
  1. This will bring up the Add Bootstrap Action dialog as depicted in the following screenshot. Type in a suitable Name for your Run if action. Since the Run if action is a predefined bootstrap action, the script's location is not an editable field. You can, however, add Optional arguments for the script, as shown here. In this case, the Run if action will only echo the message if the instance is a master:
  1. Click on Add once done. Similarly, you can add your custom bootstrap actions as well, by placing the executable scripts in the Amazon S3 bucket that we created during the prerequisite phase of this chapter and providing that path here.
  2. Moving on to the final step in this cluster creation process, on the Security Options page, you can review the various permissions, roles, authentication, and encryption settings that the cluster will use once it's deployed. Start off by selecting the EC2 key pair that we created at the start of this chapter. You can additionally opt to change the Permissions or use the default ones provided.
  3. Once done, click on Create cluster to complete the process.

The cluster's creation takes a couple of minutes, depending on the number of instances selected for the cluster, as well as the software identified to be installed. Once done, you can use the EMR dashboard to view the cluster's health status and other vital information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset