Hadoop has become an enterprise standard for big organizations working towards mining and implementing big data strategies. The use of Hadoop on a larger scale is set to become the new standard for practical, result-driven applications for data mining. However, it is a challenging task to extract data from Hadoop in order to explore it and find business insights. It is a fact that Hadoop provides cheap storage for any data but, unfortunately, it is inflexible for data analytics. There are plenty of tools that can add flexibility and interactivity for analytics tasks, but they have many restrictions.
Hunk avoids the main drawbacks of big data analytics and offers rich functionality and interactivity for analytics.
In this chapter you will learn how to deploy Hunk on top of Hadoop in order to start discovering Hunk. In addition, we will load data into Hadoop and will discover it via Hunk, using the Splunk Processing Language (SPL). Finally, we will learn about Hunk security.
In order to start exploring Hadoop data, we have to install Hunk on top of our Hadoop Cluster. Hunk is easy to install and configure. Let's learn how to deploy Hunk version 6.2.1 on top of an existing CDH cluster. It's assumed that your VM is up and running.
ls -la
to see the list of files in your home directory:[cloudera@quickstart ~]$ cd ~ [cloudera@quickstart ~]$ ls -la | grep hunk -rw-r--r-- 1 root root 113913609 Mar 23 04:09 hunk-6.2.1-249325-Linux-x86_64.tgz
cd /opt sudo tar xvzf /home/cloudera/hunk-6.2.1-249325-Linux-x86_64.tgz -C /opt
SPLUNK_HOME
environment variable. This variable has already been added to the profile:export SPLUNK_HOME=/opt/hunk
splunk-launch.conf
. This is the basic properties file used by the Hunk service. We don't have to change anything special, so let's use the default settings:Sudo cp /opt/hunk/etc/splunk-launch.conf.default /opt/hunk//etc/splunk-launch.conf
Run Hunk using the following command:
sudo /opt/hunk/bin/splunk start --accept-license
Here is the sample output from the first run:
This appears to be your first time running this version of Splunk. Copying '/opt/hunk/etc/openldap/ldap.conf.default' to '/opt/hunk/etc/openldap/ldap.conf'. Generating RSA private key, 1024 bit long modulus Some output lines were deleted to reduce amount of log text Waiting for web server at http://127.0.0.1:8000 to be available.... Done If you get stuck, we're here to help. Look for answers here: http://docs.splunk.com The Splunk web interface is at http://vm-cluster-node1.localdomain:8000
Now you can access the Hunk UI using http://localhost:8000
in the browser on your virtual machine.
We need to accomplish two tasks: providing a technical connector to underlying data storage and creating a virtual index for data on this storage.
Log in to http://localhost:8000
. The system will ask you to change the default admin user password. I have set it to admin
.
Right now we are ready to set up integration between Hadoop and Hunk. First we need to specify the way Hunk connects to the current Hadoop installation. We are using the most recent way: YARN with MR2. Then we have to point the virtual indexes to data stored in Hadoop:
Let's fill in the form to create a data provider. The data provider component is used to interact with frameworks such as Hadoop. You should set up the necessary properties in order to make sure the provider correctly gets data from the underlying datasource. We will also create a data provider for Mongo later in this book. You don't have to install something special. Cloudera VM, used as a base for this example, carries all the necessary software. Java JDK 1.7 is on board already.
Property name |
Value |
---|---|
Name |
|
Java home |
|
Hadoop home |
|
Hadoop version |
|
Filesystem |
|
Resource Manager Address |
|
Resource Scheduler Address |
|
HDFS Working Directory |
|
Job Queue |
|
You don't have to modify any other properties. The HDFS working directory has been created for you in advance. You can create it using this command:
sudo -u hdfshadoop fs -mkdir -p /user/hunk
You should see the following screen, if you did everything correctly:
Let's discuss briefly what we have done:
/opt/hunk/bin/jars/sudobash /usr/bin/hadoop jar "/opt/hunk/bin/jars/SplunkMR-s6.0-hy2.0.jar" "com.splunk.mr.SplunkMR"
default
since we are not discussing cluster utilization and load balancing.Now it's time to create a virtual index. We are going to add a dataset with AVRO files to the virtual index as example data. We will work with that index later in Chapter 6, Discovering Hunk Integration Apps.
A virtual index is metadata; it tells Hunk where data is located and what provider should be used to read that data. The virtual index goal is to declare access to data. That data could be structured or unstructured. A virtual index is immutable; you can only read data through that type of index. The data provider tells Hunk how to read data and the virtual index declares the data properties.
Property name |
Value |
---|---|
Name |
|
Path to data in HDFS |
|
part-m-00000.avro
by clicking on it. The Next Button will be activated after you pick a file:Pay attention to the Time column and the field named time_interval
in the Event column. The time_interval
field keeps the time of the record. Hunk should automatically use that field as the time field, which allows you to run a search query using a time range. It's a typical pattern for reading time series data.