Integrating Hunk with EMR and S3 is a pretty sensible proposition. If we connect the vast amounts of data that we store in HDFS or S3 with the rich capabilities of Hunk, we can build a full analytics solution for any type of data and any size of data on the cloud:
Fundamentally, we have a three-tier architecture. The first tier is data storage based on HDFS or S3. The next one is the compute or processing framework, provided by EMR. Finally, the visualization, data discovery, analytics, and app development framework is provided by Hunk.
The traditional method for hosting Hunk in the cloud is to simply buy a standard license and then provision a virtual machine in much the same way you would do it on-site. The instance would then have to be manually configured to point to the correct Hadoop or AWS cluster. This method is also called Bring Your Own License (BYOL).
On the other hand, Splunk and Amazon offer another method, in which Hunk instances can be automatically provisioned in AWS. This includes automatically discovering EMR data sources, which allows for instances to be brought online in a manner of minutes. In order to take advantage of this, Hunk instances are billed at an hourly rate. Let's try to perform both methods.
We have already run an EMR cluster. In addition, we should load data into S3 or HDFS.
Let's find and run the Hunk AMI:
There is detailed information about Amazon Machine Images at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html.
It is important to create an instance with enough resources to accommodate the expected workload and search concurrency. For instance, a c3.2xlarge
(high-CPU instance) provides a good starting point.
It is important that the Hunk instance can communicate with all EMR cluster nodes. To do this, we have to edit Security Groups in the EC2 Management page to make sure traffic is allowed to and from all ports.
If we are using S3 for our data directories, we have to set up a Hunk working directory in HDFS. This improves processing speed and allows us to keep our directory read-only, if desired.
ElasticMapReduce-master
. In addition, we should attach to Hunk-Group in order to access the web interface.As a result, we have a running EMR and Hunk with configured security. We need to copy the Public DNS. In the From EC2 menu, choose our Hunk Instance and copy the Public DNS. In addition, we can copy the Instance ID as a password for Hunk:
Then paste it in a browser and add :8000
as a default port. We then get the Hunk Web Interface:
We chose the BYOL model; that's why we should add the license file. Go to Settings | Licensing and click on Add license in order to upload the license file:
By default we can use the trial license for 60 days.
Let's configure the data provider in order to create a virtual index and start to explore our log file, based on S3:
/opt/java/latest/
./opt/hadoop/apache/hadoop-X.X.X
.s3n://<AWS key>:<AWS secret>@<s3 bucket path>
.After the provider, we need a new virtual index. On the Virtual Indexes tab, click New Virtual Index. Add a unique name, the full S3 path to the logs (optionally, we can use Whitelist if there are many log types in that path), and then click Save:
We can connect Hunk instances via SSH using our key pair and set up a data provider via configuration files. For a step-by-step guide, see: http://docs.splunk.com/Documentation/Hunk/latest/Hunk/Setupavirtualindex.
In order to connect the instance via the Terminal, we can use the following command:
ssh –i <private key> ec2-user@<public DNS>
If we don't have a Hunk license, we can use Hunk on a pay-as-you-go basis. In order to use this method, we should add Hunk as an additional application during the configuration of EMR clusters (see the Setting up an Amazon EMR cluster section).
In addition, we have two options for provisioning Hunk.
We can go to http://aws.amazon.com/cloudformation/ and create a new stack. Then, we should configure Hunk as usual.