EMR's architecture

Hadoop's core feature is data locality, that is, taking compute to where the data is. AWS disrupts this concept by separating storage and compute. AWS has multiple storage options, including the following:

  • Amazon S3: S3 is general-purpose object storage.
  • Amazon Redshift: This is a distributed cloud data warehouse.
  • Amazon DynamoDB: This is a NoSQL database.
  • Amazon Aurora: This is a cloud-based relational database.

Amazon S3 is the cheapest and most reliable cloud storage available, and this makes it the first choice, unless there is a compelling reason not to do so. EMR also supports attaching elastic block storage (EBS) volumes to compute instances (EC2) in order to provide a lower latency option.

Which option to choose depends upon what type of cluster is being created. There are two types of clusters:

  • Persistent cluster: It runs 24 x 7. Here, there is a continuous analysis of data for use cases such as fraud detection in the financial industry or clickstream analytics in ad tech. For these purposes, HDFS mounted on EBS is a good choice. 
  • Transient cluster: Here, workloads are run inconsistently, for example, genome sequencing or holiday surge in retail. In this case, the cluster is only spawned when needed, making Elastic Map Reduce File System (EMRFS) based on S3 a better choice.  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset