Data is the new oil.
—Clive Humby
Data is important for our lives, and we are using data all the time. For example, experimental data saved in a file, student record or employee record saved in a database, sales figures saved in a spreadsheet, as well as common Word, Excel, PowerPoint files, sound files, and movie files, are all traditional data files, which can be stored, analyzed, and displayed using a standard personal computer. The sizes of traditional data files range from kilobytes (KB, 210 bytes), megabytes (MB, 220 bytes) to gigabytes (GB, 230 bytes), and sometimes to terabytes (TB, 240 bytes). However, with the rapid expansion of the Internet and the increase in mobile users, data can be measured with sizes larger than petabytes, 250 bytes. These data stores are called big data. Big data is too large and too complicated to be stored and analyzed using traditional computer hardware and software.
According to Wikipedia, the world's technological per-capita data capacity has doubled roughly every 40 months since the 1980s. As of 2012, about 2.5 exabytes (2.5×1018 bytes) of data were generated every day. According to an International Data Corporation (IDC) report prediction, the volume of global data will grow exponentially from 4.4 zettabytes (4.4×1021 bytes) to 44 zettabytes between 2013 and 2020. There will be 163 zettabytes of data by 2025, as illustrated in Figure 11.1.
Java, as a modern high-level programming language, is well suited for big data applications. Java is the language of Hadoop, the most popular big data software framework. Many Hadoop key modules, such as MapReduce, are run on Java Virtual Machines (JVMs). Java can be run on many devices and platforms. Java models and implements stack data structure, which can re-establish the statistics quickly. Java can automate trash gathering and memory distribution. Java has rich networking functionalities. Java has a great security compliance and is a secure programming language.
Several sources are generating big data today.
The three defining characteristics of Big Data are its volume, velocity, and variety, sometimes described as “the three Vs of Big Data.” Here's a quick look at each:
Big data is really one of the most important emerging digital technologies in the modern world. Big data analysis can have many benefits. For example, by analyzing the information kept in a social network like Facebook, you can better understand users' responses to product advertisements and to social, economic, and political issues. By analyzing the information on e-commerce web sites, product companies and retail organizations can better plan their production according to the preferences and product perceptions of their consumers. By analyzing the previous medical history of millions of patients, doctors can make easier and earlier diagnoses, and hospitals can provide better and quicker service.
Hadoop is an open-source software framework developed by the Apache Software Foundation in the mid-2000s for the purpose of working with big data. Hadoop allows users to store and process large datasets in a distributed environment across clusters of computers using simple programming models. Hadoop is designed to scale up from a single computer to thousands of computers, each having its own local computation and storage. A Hadoop cluster consists of a master node and a number of slave nodes. Hadoop can handle various forms of structured and unstructured data, which makes big data analysis more flexible. Hadoop uses a namesake distributed file system that's designed to provide rapid data access across the nodes in a cluster. Hadoop also has fault-tolerant capabilities so that applications can continue to run even if individual nodes fail. Thanks to these advantages, Hadoop has become a key data management platform for big data analytics, such as predictive analytics, data mining, and machine learning applications. Hadoop is mostly written in Java, with some native code in C and command-line utilities written as shell scripts.
To date, financial companies have used Hadoop to build applications for assessing risk, building investment models, and creating trading algorithms. Retailers use Hadoop for analyzing structured and unstructured data to better understand and serve their customers. Telecommunications companies use Hadoop-powered analytics for predictive maintenance on their infrastructure, as well as for supporting customer-facing operations. By analyzing customer behavior and billing statements, they can offer new services to existing customers. Hadoop has been used extensively in many other applications and is a de facto standard in big data.
Hadoop consists of several key components: Hadoop Distributed File System (HDFS), MapReduce, Hadoop Common, and Hadoop Yet Another Resource Negotiator (YARN).
Hadoop Distributed File System (HDFS) provides data storage. It is designed to run on low-cost hardware with high fault tolerance. In HDFS, files are split up into blocks that are replicated to the DataNodes. By default, each block is 64 MB in size and is replicated to three DataNodes in the cluster. HDFS uses TCP/IP sockets for communication. Clients in the cluster use remote procedure calls (RPC) to communicate with each other.
HDFS has five services.
The first three services are master services, and the last two are slave services. Master services run on the master node, and slave services run on both the master node and the slave nodes.
MapReduce provides data processing. It is a software framework written in Java that can be used to create applications for processing large amounts of data. Similar to HDFS, MapReduce is also built to be fault tolerant and to be able to work in large-scale cluster environments. MapReduce splits up input data into smaller tasks (the process of mapping tasks) that can be executed in parallel processes. The output from the map tasks is then reduced, and the results are saved to HDFS. For example, suppose you want to create a word count or concordance; that is, to count the number of times each word is used across a set of documents, MapReduce will divide this job into two phases: the map phase and the reduce phase. The map phase counts the words in each document, and then the reduce phase aggregates the counts in each document into word counts spanning the entire collection.
MapReduce is Hadoop's native batch processing engine, which involves the following basic steps:
Hadoop Common is a set of shared Java utilities and libraries, which are required by other Hadoop modules. These Java libraries provide file system and OS-level abstractions and contain the necessary Java files and scripts required to start Hadoop.
Hadoop YARN is a framework for job scheduling and cluster resource management. YARN makes it possible to run much more diverse workloads on a Hadoop cluster.
Figure 11.2 shows the schematic diagram of a multinode Hadoop cluster. A typical small Hadoop cluster includes a single master node and multiple slave (worker) nodes. The master node consists of a JobTracker, TaskTracker, Name Node, and DataNode. A slave or worker node consists of both a DataNode and a TaskTracker.
Hadoop can run only in a Linux-like operating system, and it requires Java Runtime Environment (JRE) 1.6 or higher. It also requires that Secure Shell (SSH) be set up between nodes in the cluster for the standard Hadoop startup and shutdown scripts. In this chapter's hands-on example, we are going to set up Hadoop on a two-node Raspberry Pi 3 cluster, as illustrated in Figure 11.3 and summarized in Table 11.1. Raspberry Pi provides an excellent cheap solution for building your own Hadoop data cluster for learning and practicing purposes. Even though the data this system will handle won't be very “big,” it will demonstrate the structure of a cluster.
Table 11.1: The Hadoop Configuration on the Two Raspberry Pi Nodes
NAME | IP ADDRESS | HADOOP SERVICES |
Node1 | 192.168.1.139 | NameNode Secondary NameNode JobTracker DataNode TaskTracker |
Node2 | 192.168.1.79 | DataNode TaskTracker |
Please refer to Chapter 7 section 7.6 for Raspberry Pi installation and configuration.
Once the Raspberry Pi is up and running, open a Raspberry Pi terminal window; the rest of the procedure will be done through the terminal window.
Type sudo nano /etc/hostname
in the terminal window, and change its content to the following:
node1
This will change the hostname of Raspberry Pi to node1
. Here, nano
is a Linux text editor, and sudo
instructs the system to modify the file as a superuser. You can also use other Linux text editors, such as vi
, vim
, pico
, emacs
, or sublime
.
Type sudo nano /etc/hosts
in the terminal window, and append the following to its content. This will set the IP address of node1.
192.168.1.139 node1
Java should be preinstalled with Raspbian; you can double-check this by typing java -version
from Raspberry Pi terminal window.
java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) Client VM (build 25.65-b01, mixed mode)
Now restart Raspberry Pi by typing the following command:
sudo reboot
In this section, you'll take the following steps to install and configure Hadoop on your cluster:
To install Hadoop, you will need to prepare a Hadoop user account and a Hadoop group. From a Raspberry Pi terminal window, type the following commands to create a Hadoop group called hadoop
, add a user called hduser
to the group, and make hduser
a superuser or administrator. The second command will prompt you for your password and other information for hduser
, provide a password, and use default values for all the rest of the information.
$sudo addgroup hadoop
$sudo adduser --ingroup hadoop hduser
$sudo adduser hduser sudo
Type the following commands to create a pair of SSH RSA keys with a blank password so that Hadoop nodes will be able to talk with each other without prompting for a password:
$su hduser
$mkdir ~/.ssh
$ssh-keygen -t rsa -P ""
$cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
$exit
Type the following commands to verify that hduser
can log in to SSH:
$su hduser
$ssh localhost
$exit
Type the following commands to download and install Hadoop version 2.9.2:
$cd ~/
$wget http://mirror.vorboss.net/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
$sudo mkdir /opt
$sudo tar -xvzf hadoop-2.9.2.tar.gz -C /opt/
$cd /opt
$sudo mv hadoop-2.9.2 hadoop
$sudo chown -R hduser:hadoop hadoop
Type sudo nano /etc/bash.bashrc
to add the following lines at the end of the bashrc
file:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_INSTALL/bin
Then type source ~/.bashrc
to apply the changes. Now switch to hduser
to verify that the Hadoop executable is accessible outside the /opt/hadoop/bin
folder:
$su hduser
$hadoop version
hduser@node1 /home/hduser $ hadoop version
Hadoop 2.9.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 826afbeae31ca687bc2f8471dc841b66ed2c6704
Compiled by ajisaka on 2018-11-13T12:42Z
Compiled with protoc 2.5.0
From source with checksum 3a9939967262218aa556c684d107985
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.9.2.jar
Type sudo nano /opt/hadoop/etc/hadoop/hadoop-env.sh
, and then append the following lines:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HEAPSIZE=250
Now move to the /opt/hadoop/etc/hadoop/
directory and use a text editor to edit the following configuration files as a superuser:
core-site.xml<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://node1:9000</value>
</property>
</configuration>
mapred-site.xml<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>file:/opt/hadoop/hadoop_data/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Type the following commands to create and format an HDFS file system:
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/namenode
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
$sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
$sudo chmod 750 /opt/hadoop/hadoop_data/hdfs
$cd $HADOOP_INSTALL
$hdfs namenode -format
Type the following commands to switch to hduser
and start Hadoop services. The command jps
will display all the services running:
$su hduser
$cd $HADOOP_HOME/sbin
$./start-dfs.sh
$./start-yarn.sh
$jps
2082 NameNode
2578 ResourceManager
2724 Jps
2344 SecondaryNameNode
2683 NodeManager
2189 DataNode
$./stop-dfs.sh
$./stop-yarn.sh
Type in the following commands to run an example file provided with Hadoop, named pi
, which calculates the value of pi:
$cd $HADOOP_INSTALL/bin
$./hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar pi 16 1000
Let's run another example provided with Hadoop, named wordCount
.
Enter the following commands to copy the LICENSE.txt
file from the local file system in the /opt/hadoop/
folder to Hadoop's distributed file system as license.txt
in the root folder; that is, the /
folder.
$hdfs dfs -copyFromLocal /opt/hadoop/LICENSE.txt /license.txt
Enter the following commands to list the content of Hadoop distributed file system in the /
folder.
$hdfs dfs -ls /
Enter the following commands to run the wordCount
example and save the result into a folder called license-out.txt
.
$cd /opt/hadoop/bin
$./hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /license.txt /license-out.txt
Figure 11.4 shows the output from running wordCount
on the file license.txt
. Because the output is long, only the beginning (top) and end of the output (bottom) are shown.
Type in the following command to copy the license-out.txt
file from the Hadoop distributed file system to local file system, and open license-out.txt/part-r-00000
with nano
.
$hdfs dfs -copyToLocal /license-out.txt ~/
$nano ~/license-out.txt/part-r-00000
Figure 11.5 shows the previous commands and all the word count results with nano
.
If you don't want to open the license-out.txt/part-r-00000
file, you can also use the following command just to display the content.
$cat ~/license-out.txt/part-r-00000
The following command shows how to delete a directory with nonempty files in it:
$hdfs dfs -rm -r /license-out.txt
You can view Hadoop and its application cluster with a web browser, as shown in Figure 11.6.
http://node1:50070
http://node1:8088
To add the second node to the Hadoop cluster, node2, just get another Raspberry Pi, ideally an identical one. Then you can simply clone the node1's Raspberry Pi SD Card for node2, by using Win32 Disk Imager, as shown in Figure 11.7.
https://sourceforge.net/projects/win32diskimager/files/latest/download
Once the node2 is up and running, remember to edit the /etc/hosts
file to include the IP address of node2, as shown next. Please change the IP addresses to yours accordingly. You will also need to reconfigure SSH for node2.
$sudo nano /etc/hosts
192.168.1.139 node1
192.168.1.76 node2
Finally, type in the following commands to delete HDFS storage and add permissions on node1:
$sudo rm -rf /opt/hadoop/hadoop_data
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/namenode
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
$sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
$sudo chmod 750 /opt/hadoop/hadoop_data/hdfs
Type in the similar commands on node2:
$sudo rm -rf /opt/hadoop/hadoop_data
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
$sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
$sudo chmod 750 /opt/hadoop/hadoop_data/hdfs
See the following resources for more information about Hadoop:
http://hadoop.apache.org/docs/stable/
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
https://www.javatpoint.com/hadoop-tutorial
You can also write your own Java program for Hadoop. Example 11.1 shows a simple Java program that can do word counting, as described earlier. It has three classes: The MyMapper
class represents the Map phase of Hadoop MapReduce, which implements the map()
method to count the number of words in each document. The MyReducer
class represents the Reduce phase of Hadoop MapReduce, which implements the reducer()
method to aggregate the counts in each document into word counts of all the documents. The main class calls the MyMapper
class and the MyReducer
class and displays the final results.
To compile the previous Java Hadoop program, you need to include the Hadoop class path into the compilation class path. From the Raspberry Pi Terminal window, type in the following command to display the content of the Hadoop class path:
$hadoop classpath
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre//lib/tools.jar:/opt/hadoop/contrib/capacity-scheduler/*.jar
Type in the following command to compile the Java program, which includes the Hadoop class path:
$javac -cp $(hadoop classpath) MyWordCounter.java
Enter the following command to list all the class files that have been generated:
$ls -la My*.class
-rw-r--r-- 1 hduser hadoop 1682 Apr 10 04:45 MyMapper.class
-rw-r--r-- 1 hduser hadoop 1690 Apr 10 04:45 MyReducer.class
-rw-r--r-- 1 hduser hadoop 1728 Apr 10 04:45 MyWordCounter.class
-rw-r--r-- 1 hduser hadoop 1734 Apr 10 04:41 'MyWordCounter$MyMapper.class'
-rw-r--r-- 1 hduser hadoop 1743 Apr 10 04:41 'MyWordCounter$MyReducer.class'
Then use the following command to include all the class files into a JAR file called Test.jar
:
$jar -cvf Test.jar My*.class
Now enter the following command to run the Test.jar
file on the /license.txt
file and save the result in a folder named /licentse-out.txt
, exactly as we did previously.
$hadoop jar Test.jar MyWordCounter /license.txt /license-out.txt
Similarly, type in the following command to copy the license-out.txt
file from the Hadoop distributed file system to the local file system, and open license-out.txt/part-r-00000
with nano
:
$hdfs dfs -copyToLocal /license-out.txt ~/
$nano ~/license-out.txt/part-r-00000
Finally, type in the following command to delete the /licentse-out.txt
directory:
$hdfs dfs -rm -r /license-out.txt
Example 11.2 shows a variation of the previous Java program. It counts the number of characters in documents. The main difference is in the MyMapper
class, where it counts the characters rather than the words. The MyReducer
class and the main class are quite similar. Also different from the previous example, both the MyMapper
class and the MyReducer
class extend from the MapReduceBase
class, which is common for earlier versions of Hadoop. You can compile and run this program exactly the same as the previous program.
Let's look at another Java Hadoop program. Assume for the purpose of a simple example that a text file named grades.txt
contains the student marks for four different subjects as shown in the following excerpt. In each line, data are tab-separated. The first item is the student's given name, and the second to fifth are the student's marks in the respective subjects.
Tony 56 76 83 42
William 33 91 82 73
Alan 76 39 65 89
Tom 51 68 77 52
John 88 54 94 98
Example 11.3 shows another Java Hadoop program, which reads the student marks from a file and counts the total marks for each student. Again, there are three classes. The MyMapper
class counts the characters rather than the words. The MyReducer
class and the main class are quite similar.
Type in the following commands to compile and to run the Java program:
$javac -cp $(hadoop classpath) MyMarker.java
$jar -cvf Test.jar My*.class
$hdfs dfs -copyFromLocal ~/marks.txt /marks.txt
$hadoop jar Test.jar MyMarker /marks.txt /marks-out.txt
$hdfs dfs -copyToLocal /marks-out.txt ~/
$cat ~/marks-out.txt/part-r-00000
Finally, type in the following command to delete the /marks-out.txt
directory:
$hdfs dfs -rm -r /marks-out.txt
This chapter introduced the concept of big data, the sources of big data, the three Vs of big data, and the benefits of big data. It also introduced Hadoop, the open source software for big data, and showed how to download, set up, and use Hadoop software on Raspberry Pi.
Q11.1. | What is big data? |
Q11.2. | Explain terms of kilobyte, megabyte, gigabyte, terabyte, petabyte, exabyte, and zettabyte. |
Q11.3. | What are the sources of big data? |
Q11.4. | What are the three Vs of big data? |
Q11.5. | What is Hadoop? |
Q11.6. | What are the key components of Hadoop? |
Q11.7. | What are HDFS, MapReduce, Hadoop Common, and Hadoop YARN? |
Q11.8. | What are the Hadoop five services? |
Q11.9. | How do you set up SSH for Hadoop? |
Q11.10. | How do you start and stop Hadoop? |