Let's look at the options shown in step 3:
- Cluster name: This is where you provide an appropriate name for the cluster.
- S3 folder: This is the folder location where the S3 bucket's logs for this cluster will go to.
- Launch mode:
- Cluster: The cluster will continue to run until you terminate it.
- Step execution: This is to add steps after the application is launched.
- Software configuration:
- Vendor: This is Amazon EMI with the open source Hadoop versus MapR's version.
- Release: This is self-evident.
- Applications:
- Core Hadoop: This is focused on the SQL interface.
- HBase: This is focused on partial no-SQL-oriented workloads.
- Presto: This is focused on ad-hoc query processing.
- Spark: This is focused on Spark.
- Hardware configuration:
- Instance type: This topic will be covered in detail in the next section.
- Number of instances: This refers to the number of nodes in the cluster. One of them will be the master node and the rest slave nodes.
- Security and access:
- EC2 key pair: You can associate an EC2 key pair with the cluster that you can use to connect to it via SSH.
- Permissions: You can allow other users besides the default Hadoop user to submit jobs.
- EMR role: This allows EMR to call other AWS services, such as EC2, on your behalf.
- EC2 instance profile: This provides access to other AWS services, such as S3 and DynamoDB, via the EC2 instances that are launched by EMR.