Concepts and terminologies

Before we get started with Amazon EMR, it is important to understand some of its key concepts and terminologies, starting out with clusters and nodes:

  • Clusters: Clusters are the core functioning component in Amazon EMR. A cluster is a group of EC2 instances that together can be used to process your workloads. Each instance within a cluster is termed as a node and each node has a different role to perform within the cluster.
  • Nodes: Amazon EMR distinguishes between clusters instances by providing them with one of these three roles:
    • Master node: An instance that is responsible for the overall manageability, working and monitoring of your cluster. The master node takes care of all the data and task distributions that occur within the cluster.
    • Core node: The core nodes are very similar to the master node; however, they are primarily used to run tasks and store data on your Hadoop Distributed File System (HDFS). The core node can also contain some additional software components of Hadoop applications within itself.
    • Task node: Task nodes are only designed to run tasks. They do not contain any additional software components of Hadoop applications within themselves and are optional when it comes to the cluster's deployment.
  • Steps: Steps are simple tasks or jobs that are submitted to a cluster for processing. Each step contains some instructions on how the particular job is to be performed. Steps can be ordered such that a particular step can be used to fetch the input data from Amazon S3, while a second step can be used to run a Pig or Hive query against it, and finally a third step to store output data to say Amazon DynamoDB. If one step fails, the subsequent steps are automatically cancelled from execution, however, you can choose to overwrite this behavior by selecting your steps to ignore failures and process further.

Apart from these concepts, you will additionally be required to brush up on your Apache Hadoop framework and terminologies, as well. Here's a quick look at some of the Apache frameworks and applications that you will come across while working with Amazon EMR:

  • Storage: A big part of EMR is how the data is actually stored and retrieved. The following are some of the storage options that are provided to you while using Amazon EMR:
    • Hadoop Distributed File System (HDFS): As the name suggests, HDFS is a distributed and scalable filesystem that allows data to be stored across the underlying node instances. By default, the data is duplicated and stored across the instances present in the cluster. This provides high availability and data resiliency in case of an instance failure. You can read more about HDFS at: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html.
    • EMR File System (EMRFS): EMRFS is an extension of the HDFS filesystem, using which you can access and store data directly on Amazon S3, just as a normal filesystem.
    • Local filesystem: Apart from HDFS, each instance within the cluster is also provided with a small block of pre-attached ephemeral disks which is also called the local filesystem. You can use this local filesystem to store additional software or applications required by your Hadoop frameworks.
  • Frameworks: As mentioned before, Amazon EMR provides two data processing frameworks that you can leverage based on your processing needs: Apache Hadoop MapReduce and Apache Spark:
    • Apache Hadoop MapReduce: MapReduce is by far the most commonly used and widely known programming model when it comes to building distributed applications. The open source model relies on a Mapper function that maps the data to sets of key-value pairs and a Reducer function that combines these key-value pairs, applies some additional processing, and finally generates the desired output. To know more about MapReduce and how you can leverage it, check out this URL: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.
    • Apache Spark: Apache Spark is a fast, in-memory data processing model using which a developer can process streaming, machine learning or SQL workloads that require fast iterative access to datasets. It is a cluster framework similar to Apache Hadoop; however, Spark leverages graphs and in-memory databases for accessing your data. You can read more about Spark at https://spark.apache.org/.
  • Applications and programs: With the standard data processing framework, Amazon EMR also provides you with additional applications and programs that you can leverage to build native distributed applications. Here's a quick look into a couple of them:
    • YARN: Yet Another Resource Negotiator, is a part of the Hadoop framework and provides management for your cluster's data resources
    • Hive: Hive is a distributed data warehousing application that leverages standard SQL to query extremely large datasets stored on the HDFS filesystem.

There are yet many other applications and programs made available for use by Amazon EMR, such as Apache Pig, Apache HBase, Apache Zookeeper, and so on. In the next section, we will be looking at how to leverage these concepts and terminologies to create our very own Amazon EMR Cluster, so let's get busy!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset