Hadoop on the cloud is a new deployment option that allows organizations to create and customize Hadoop clusters on virtual machines utilizing the computing resources of virtual instances and deployment scripts. Similar to the on-premise full custom option, this gives businesses full control of the cluster. In addition, it gives flexibility and many advantages—for example, capacity on demand, decreased staff costs, storage services, and technical support. Finally, it gives the opportunity to get fast time to value, that is we can deploy our infrastructure in the Amazon cloud and start analyze our data very quickly because we don't need setup hardware and software as well as we don't need many technical resources. One of the most popular Hadoop cloud is Amazon Elastic MapReduce (EMR).
With Hunk we can interactively explore, analyze, and visualize data stored in Amazon EMR and Amazon S3. The integrated offering lets AWS and Splunk customers:
In this chapter, the reader will learn how to run Amazon EMR and deploy Hunk on top of it. In addition, the reader will create virtual indexes and use the Amazon S3 file system.
In this section, we will learn about Amazon EMR and Simple Storage Service (S3). Moreover, we try to run these services by creating EMR clusters and S3 buckets.
Amazon EMR is a Hadoop framework in the cloud offered as a managed service. It is used by thousands of customers. It uses millions of EMR clusters in a variety of big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. EMR can easily process any type of big data without its own big data infrastructure:
As with any other Amazon service, EMR is easy to run by filling in option forms. Enter the cluster name, the size, and the types of node in the cluster. And it creates in two minutes a fully running EMR cluster. It is ready to process data. It removes all the headache of maintaining clusters and version compatibility. Amazon takes care of all tasks involved in running and supporting Hadoop.
Let's start EMR cluster in order to connect to Hunk:
emr-cluster-packtpub
. In addition, we can switch off Logging and Termination protection:We can learn more about how to plan the capacity for EMR here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-instances.html.
If we want to pay for Hunk by the hour, we should add Hunk as an additional application:
Amazon S3 provides developers and IT teams with secure, durable, and highly-scalable object storage. Amazon S3 is easy to use, with a simple web service interface to store and retrieve any amount of data from anywhere on the Web. We can easily write and retrieve objects, which can range in size from a few bytes to terabytes, and we can work with an unlimited number of files. The process of interacting with S3 is very trivial, we just use the web service interface to write or retrieve objects. It is reliable, secure, and durable. Finally, it is backed by AWS SLA:
Amazon S3 can be a data provider for Hunk. Let's create a bucket and upload two files with two weeks' worth of all HTTP requests to the ClarkNet WWW server. ClarkNet is a full Internet access provider for the Metro Baltimore-Washington DC area, and can be found in the attachment to this chapter or as a direct download from: http://ita.ee.lbl.gov/html/contrib/ClarkNet-HTTP.html:
my-web-logs
.clarknet_access_log_Aug28
and clarknet_access_log_Sep4
.Amazon Elastic MapReduce can be seen as both a complement and a competitor to Hadoop. EMR can run on top of a Hadoop HDFS cluster, but it can also run directly on top of AWS S3. There are several advantages to using S3 and EMR together. First of all, using Amazon EMR and S3 gives us full native support to access data—in other words, we are provided with a full distributed file system and full support from EMR. EMR runs on top of S3 and S3 works as a data store. In addition, it allows us to avoid the complexity of Hadoop and HDFS management. For example, if we have Hadoop on-premise, it is not easy to maintain it:
Moreover, EMR is elastic; it is easy to increase clusters dynamically on demand. Finally, it uses a pay for what you use model. For example:
Furthermore, EMR and S3 are very popular with thousands of customers; they have a big ecosystem and a very large community.