CHAPTER 11
Java Programming for Big Data Applications

Data is the new oil.

—Clive Humby

11.1 What Is Big Data?

Data is important for our lives, and we are using data all the time. For example, experimental data saved in a file, student record or employee record saved in a database, sales figures saved in a spreadsheet, as well as common Word, Excel, PowerPoint files, sound files, and movie files, are all traditional data files, which can be stored, analyzed, and displayed using a standard personal computer. The sizes of traditional data files range from kilobytes (KB, 210 bytes), megabytes (MB, 220 bytes) to gigabytes (GB, 230 bytes), and sometimes to terabytes (TB, 240 bytes). However, with the rapid expansion of the Internet and the increase in mobile users, data can be measured with sizes larger than petabytes, 250 bytes. These data stores are called big data. Big data is too large and too complicated to be stored and analyzed using traditional computer hardware and software.

According to Wikipedia, the world's technological per-capita data capacity has doubled roughly every 40 months since the 1980s. As of 2012, about 2.5 exabytes (2.5×1018 bytes) of data were generated every day. According to an International Data Corporation (IDC) report prediction, the volume of global data will grow exponentially from 4.4 zettabytes (4.4×1021 bytes) to 44 zettabytes between 2013 and 2020. There will be 163 zettabytes of data by 2025, as illustrated in Figure 11.1.

Graph depicting Year on the horizontal axis, Data Volume on the vertical axis, and a curve depicting rising Global Data Volume after 2015.

Figure 11.1: The global data volume in zettabytes according to an IDC report prediction

Plotted according to the data from https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

Java, as a modern high-level programming language, is well suited for big data applications. Java is the language of Hadoop, the most popular big data software framework. Many Hadoop key modules, such as MapReduce, are run on Java Virtual Machines (JVMs). Java can be run on many devices and platforms. Java models and implements stack data structure, which can re-establish the statistics quickly. Java can automate trash gathering and memory distribution. Java has rich networking functionalities. Java has a great security compliance and is a secure programming language.

11.2 Sources of Big Data

Several sources are generating big data today.

  • Social Media Web Sites Web sites such as Facebook, Twitter, Snapchat, WhatsApp, YouTube, Google, Yahoo, and others generate massive amounts of data each day, from their billions of users around the world.
  • E-commerce Web Sites Online shopping web sites such as Amazon, eBay, Alibaba, and the like also generate tons of data, which can be analyzed to understand consumer shopping habits and sales predictions.
  • Telecommunication Companies Telecommunication giants such as Verizon, AT&T, China Mobile, Nippon, EE, Vodafone, and Telefonica also generate huge amounts of data by storing communication records and customer information.
  • Stock Market Stock markets around the world generate huge amounts of data by storing daily transactions.
  • Internet of Things (IoT) With billions and billions of devices connected to the IoT, the IoT also generate an enormous amount of data every day from sensor-enabled devices.
  • Hospitals Hospitals generate massive amounts of data from patient records.
  • Banks Banks generate massive amounts of data from the transactions of their customers.
  • Weather Stations Weather stations generate a massive amount of data from satellite images for weather forecasting.
  • Government Departments Government departments also hold personal information of millions and millions citizens.

11.3 The Three Vs of Big Data

The three defining characteristics of Big Data are its volume, velocity, and variety, sometimes described as “the three Vs of Big Data.” Here's a quick look at each:

  • Volume The volume of big data is massive, typically ranging from tens of terabytes to hundreds of petabytes. For example, Facebook has 2 billion users, WeChat has 1 billion users, YouTube has 1 billion users, WhatsApp has 1 billion users, Instagram has 1 billion users, Alibaba has 600 million users, Twitter has 300 million users, and Snapchat has 180 million users. These users generate billions of images, posts, videos, tweets, and so on, every single day.
  • Velocity The speed at which big data is generated is also breathtaking. Facebook, for example, has 317,000 status updates, 400 new users, 147,000 photos uploaded, and 54,000 shared links every 60 seconds. Big data is also growing at an increasing speed. According to the IDC report mentioned earlier, it is estimated that the amount of data will double every two years.
  • Variety Big data can consist of many available data types. They include structured data types such as texts, pictures, and videos; semistructured data types such as XML data; and unstructured data types such as handwritten text, drawings, voice records, and measurement data. All of these forms of data will require additional preprocessing to derive meaning and support metadata.

11.4 Benefits of Big Data

Big data is really one of the most important emerging digital technologies in the modern world. Big data analysis can have many benefits. For example, by analyzing the information kept in a social network like Facebook, you can better understand users' responses to product advertisements and to social, economic, and political issues. By analyzing the information on e-commerce web sites, product companies and retail organizations can better plan their production according to the preferences and product perceptions of their consumers. By analyzing the previous medical history of millions of patients, doctors can make easier and earlier diagnoses, and hospitals can provide better and quicker service.

11.5 What Is Hadoop?

Hadoop is an open-source software framework developed by the Apache Software Foundation in the mid-2000s for the purpose of working with big data. Hadoop allows users to store and process large datasets in a distributed environment across clusters of computers using simple programming models. Hadoop is designed to scale up from a single computer to thousands of computers, each having its own local computation and storage. A Hadoop cluster consists of a master node and a number of slave nodes. Hadoop can handle various forms of structured and unstructured data, which makes big data analysis more flexible. Hadoop uses a namesake distributed file system that's designed to provide rapid data access across the nodes in a cluster. Hadoop also has fault-tolerant capabilities so that applications can continue to run even if individual nodes fail. Thanks to these advantages, Hadoop has become a key data management platform for big data analytics, such as predictive analytics, data mining, and machine learning applications. Hadoop is mostly written in Java, with some native code in C and command-line utilities written as shell scripts.

To date, financial companies have used Hadoop to build applications for assessing risk, building investment models, and creating trading algorithms. Retailers use Hadoop for analyzing structured and unstructured data to better understand and serve their customers. Telecommunications companies use Hadoop-powered analytics for predictive maintenance on their infrastructure, as well as for supporting customer-facing operations. By analyzing customer behavior and billing statements, they can offer new services to existing customers. Hadoop has been used extensively in many other applications and is a de facto standard in big data.

11.6 Key Components of Hadoop

Hadoop consists of several key components: Hadoop Distributed File System (HDFS), MapReduce, Hadoop Common, and Hadoop Yet Another Resource Negotiator (YARN).

11.6.1 HDFS

Hadoop Distributed File System (HDFS) provides data storage. It is designed to run on low-cost hardware with high fault tolerance. In HDFS, files are split up into blocks that are replicated to the DataNodes. By default, each block is 64 MB in size and is replicated to three DataNodes in the cluster. HDFS uses TCP/IP sockets for communication. Clients in the cluster use remote procedure calls (RPC) to communicate with each other.

HDFS has five services.

  • NameNode (master service)
  • Secondary NameNode (master service)
  • JobTracker (master service)
  • DataNode (slave service)
  • TaskTracker (slave service)

The first three services are master services, and the last two are slave services. Master services run on the master node, and slave services run on both the master node and the slave nodes.

  • The NameNode is the master, and the DataNodes are its corresponding slaves. A Hadoop cluster has a single NameNode and a cluster of DataNodes. The NameNode can track the files and manage the file system. The NameNode contains detailed information including the total number of blocks, the locations of specific blocks (to determine what parts of the data are stored in which node), where the replications are stored, and the like.
  • DataNodes store data as blocks for the clients to read and write. DataNodes are slave daemons. DataNodes send a heartbeat message to the NameNode every three seconds to show they are alive. If the NameNode does not receive a heartbeat from a DataNode for two minutes, it will assume that DataNode is dead.
  • The Secondary NameNode, also known as the checkpoint node, is a helper node for the NameNode. It periodically updates the file system metadata by fetching relevant files (such as edits and fsimage) from the NameNode and merges them.
  • JobTracker receives the requests for MapReduce execution from the client. JobTracker communicates with the NameNode to know about the location of the data. The NameNode provides the metadata to the JobTracker.
  • The TaskTracker takes the task and the code from the JobTracker. The TaskTracker will apply the code to the file, a process known as mapping.

11.6.2 MapReduce

MapReduce provides data processing. It is a software framework written in Java that can be used to create applications for processing large amounts of data. Similar to HDFS, MapReduce is also built to be fault tolerant and to be able to work in large-scale cluster environments. MapReduce splits up input data into smaller tasks (the process of mapping tasks) that can be executed in parallel processes. The output from the map tasks is then reduced, and the results are saved to HDFS. For example, suppose you want to create a word count or concordance; that is, to count the number of times each word is used across a set of documents, MapReduce will divide this job into two phases: the map phase and the reduce phase. The map phase counts the words in each document, and then the reduce phase aggregates the counts in each document into word counts spanning the entire collection.

MapReduce is Hadoop's native batch processing engine, which involves the following basic steps:

  1. Reading data from HDFS.
  2. Dividing data into small chunks and distributed among nodes.
  3. Applying the computation on each node.
  4. Saving the intermediate results to HDFS.
  5. Redistributing the intermediate results to group by key.
  6. Reducing the value of each key by summarizing and combining the results from each node.
  7. Saving the final results to HDFS.

11.6.3 Hadoop Common

Hadoop Common is a set of shared Java utilities and libraries, which are required by other Hadoop modules. These Java libraries provide file system and OS-level abstractions and contain the necessary Java files and scripts required to start Hadoop.

11.6.4 Hadoop YARN

Hadoop YARN is a framework for job scheduling and cluster resource management. YARN makes it possible to run much more diverse workloads on a Hadoop cluster.

11.6.5 Overview of a Hadoop Cluster

Figure 11.2 shows the schematic diagram of a multinode Hadoop cluster. A typical small Hadoop cluster includes a single master node and multiple slave (worker) nodes. The master node consists of a JobTracker, TaskTracker, Name Node, and DataNode. A slave or worker node consists of both a DataNode and a TaskTracker.

Image described by caption and surrounding text.

Figure 11.2: A multinode Hadoop cluster

Adapted from https://en.wikipedia.org/wiki/Apache_Hadoop

11.7 Implementing Hadoop on a Raspberry Pi Cluster

Hadoop can run only in a Linux-like operating system, and it requires Java Runtime Environment (JRE) 1.6 or higher. It also requires that Secure Shell (SSH) be set up between nodes in the cluster for the standard Hadoop startup and shutdown scripts. In this chapter's hands-on example, we are going to set up Hadoop on a two-node Raspberry Pi 3 cluster, as illustrated in Figure 11.3 and summarized in Table 11.1. Raspberry Pi provides an excellent cheap solution for building your own Hadoop data cluster for learning and practicing purposes. Even though the data this system will handle won't be very “big,” it will demonstrate the structure of a cluster.

Schematic diagram depicting Name Node (Node1) and Data Node (Node2) beaming to Broadband Router connected to Computer with IPs given.

Figure 11.3: Hadoop on a two-node Raspberry Pi cluster

Table 11.1: The Hadoop Configuration on the Two Raspberry Pi Nodes

NAME IP ADDRESS HADOOP SERVICES
Node1 192.168.1.139 NameNode
Secondary NameNode
JobTracker
DataNode
TaskTracker
Node2 192.168.1.79 DataNode
TaskTracker

11.7.1 Raspberry Pi Installation and Configuration

Please refer to Chapter 7 section 7.6 for Raspberry Pi installation and configuration.

Once the Raspberry Pi is up and running, open a Raspberry Pi terminal window; the rest of the procedure will be done through the terminal window.

Type sudo nano /etc/hostname in the terminal window, and change its content to the following:

node1

This will change the hostname of Raspberry Pi to node1. Here, nano is a Linux text editor, and sudo instructs the system to modify the file as a superuser. You can also use other Linux text editors, such as vi, vim, pico, emacs, or sublime.

Type sudo nano /etc/hosts in the terminal window, and append the following to its content. This will set the IP address of node1.

192.168.1.139           node1

Java should be preinstalled with Raspbian; you can double-check this by typing java -version from Raspberry Pi terminal window.

java -version
  
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) Client VM (build 25.65-b01, mixed mode)

Now restart Raspberry Pi by typing the following command:

sudo reboot

11.7.2 Hadoop Installation and Configuration

In this section, you'll take the following steps to install and configure Hadoop on your cluster:

  1. Prepare the Hadoop user account and group.
  2. Configure SSH.
  3. Download and install Hadoop.
  4. Configure the environment variables.
  5. Configure Hadoop.
  6. Start and stop Hadoop services.
  7. Test Hadoop.
  8. Hadoop on a Web Browser.

Prepare the Hadoop User Account and Group

To install Hadoop, you will need to prepare a Hadoop user account and a Hadoop group. From a Raspberry Pi terminal window, type the following commands to create a Hadoop group called hadoop, add a user called hduser to the group, and make hduser a superuser or administrator. The second command will prompt you for your password and other information for hduser, provide a password, and use default values for all the rest of the information.

$sudo addgroup hadoop
$sudo adduser --ingroup hadoop hduser
$sudo adduser hduser sudo

Configure SSH

Type the following commands to create a pair of SSH RSA keys with a blank password so that Hadoop nodes will be able to talk with each other without prompting for a password:

$su hduser
$mkdir ~/.ssh
$ssh-keygen -t rsa -P ""
$cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
$exit

Type the following commands to verify that hduser can log in to SSH:

$su hduser
$ssh localhost
$exit

Download and Install Hadoop

Type the following commands to download and install Hadoop version 2.9.2:

$cd ~/
$wget http://mirror.vorboss.net/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
$sudo mkdir /opt
$sudo tar -xvzf hadoop-2.9.2.tar.gz -C /opt/
$cd /opt
$sudo mv hadoop-2.9.2 hadoop
$sudo chown -R hduser:hadoop hadoop

Configure Environment Variables

Type sudo nano /etc/bash.bashrc to add the following lines at the end of the bashrc file:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_INSTALL/bin

Then type source ~/.bashrc to apply the changes. Now switch to hduser to verify that the Hadoop executable is accessible outside the /opt/hadoop/bin folder:

$su hduser
$hadoop version
 
hduser@node1 /home/hduser $ hadoop version
Hadoop 2.9.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 826afbeae31ca687bc2f8471dc841b66ed2c6704
Compiled by ajisaka on 2018-11-13T12:42Z
Compiled with protoc 2.5.0
From source with checksum 3a9939967262218aa556c684d107985
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.9.2.jar

Type sudo nano /opt/hadoop/etc/hadoop/hadoop-env.sh, and then append the following lines:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HEAPSIZE=250

Configure Hadoop

Now move to the /opt/hadoop/etc/hadoop/ directory and use a text editor to edit the following configuration files as a superuser:

core-site.xml
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://node1:9000</value>
  </property>
</configuration>

mapred-site.xml
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>
hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/opt/hadoop/hadoop_data/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.name.dir</name>
    <value>file:/opt/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>
yarn-site.xml
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

Type the following commands to create and format an HDFS file system:

$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/namenode
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
$sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
$sudo chmod 750 /opt/hadoop/hadoop_data/hdfs
 
$cd $HADOOP_INSTALL
$hdfs namenode -format

Start and Stop Hadoop Services

Type the following commands to switch to hduser and start Hadoop services. The command jps will display all the services running:

$su hduser
$cd $HADOOP_HOME/sbin
$./start-dfs.sh
$./start-yarn.sh
$jps

2082 NameNode
2578 ResourceManager
2724 Jps
2344 SecondaryNameNode
2683 NodeManager
2189 DataNode

$./stop-dfs.sh
$./stop-yarn.sh

Test Hadoop

Type in the following commands to run an example file provided with Hadoop, named pi, which calculates the value of pi:

$cd $HADOOP_INSTALL/bin
$./hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar pi 16 1000

Let's run another example provided with Hadoop, named wordCount.

Enter the following commands to copy the LICENSE.txt file from the local file system in the /opt/hadoop/ folder to Hadoop's distributed file system as license.txt in the root folder; that is, the / folder.

$hdfs dfs -copyFromLocal /opt/hadoop/LICENSE.txt /license.txt

Enter the following commands to list the content of Hadoop distributed file system in the / folder.

$hdfs dfs -ls /

Enter the following commands to run the wordCount example and save the result into a folder called license-out.txt.

$cd /opt/hadoop/bin
$./hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /license.txt /license-out.txt

Figure 11.4 shows the output from running wordCount on the file license.txt. Because the output is long, only the beginning (top) and end of the output (bottom) are shown.

Image described by caption and surrounding text.

Figure 11.4: The output from running the wordCount example on the file license.txt

Type in the following command to copy the license-out.txt file from the Hadoop distributed file system to local file system, and open license-out.txt/part-r-00000 with nano.

$hdfs dfs -copyToLocal /license-out.txt ~/
$nano ~/license-out.txt/part-r-00000

Figure 11.5 shows the previous commands and all the word count results with nano.

Image described by caption and surrounding text.

Figure 11.5: The commands to copy the license-out.txt file to local file system and to open license-out.txt/part-r-00000 with nano (top) and result of opening the license-out.txt/part-r-00000 file with the nano text editor (bottom)

If you don't want to open the license-out.txt/part-r-00000 file, you can also use the following command just to display the content.

$cat ~/license-out.txt/part-r-00000

The following command shows how to delete a directory with nonempty files in it:

$hdfs dfs -rm -r /license-out.txt

Hadoop on a Web Browser

You can view Hadoop and its application cluster with a web browser, as shown in Figure 11.6.

Image described by caption and surrounding text.

Figure 11.6: The web view of Hadoop software (top) and the application cluster (bottom) with a web browser

http://node1:50070

http://node1:8088

To add the second node to the Hadoop cluster, node2, just get another Raspberry Pi, ideally an identical one. Then you can simply clone the node1's Raspberry Pi SD Card for node2, by using Win32 Disk Imager, as shown in Figure 11.7.

Screen capture of the Win32 Disk Imager window with Raspberry Pi SD Card image file selected.

Figure 11.7: Cloning the Raspberry Pi SD Card using Win32 Disk Imager

https://sourceforge.net/projects/win32diskimager/files/latest/download

Once the node2 is up and running, remember to edit the /etc/hosts file to include the IP address of node2, as shown next. Please change the IP addresses to yours accordingly. You will also need to reconfigure SSH for node2.

$sudo nano /etc/hosts

192.168.1.139 node1
192.168.1.76 node2

Finally, type in the following commands to delete HDFS storage and add permissions on node1:

$sudo rm -rf /opt/hadoop/hadoop_data
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/namenode
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
$sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
$sudo chmod 750 /opt/hadoop/hadoop_data/hdfs

Type in the similar commands on node2:

$sudo rm -rf /opt/hadoop/hadoop_data
$sudo mkdir -p /opt/hadoop/hadoop_data/hdfs/datanode
$sudo chown hduser:hadoop /opt/hadoop/hadoop_data/hdfs -R
$sudo chmod 750 /opt/hadoop/hadoop_data/hdfs

See the following resources for more information about Hadoop:

http://hadoop.apache.org/docs/stable/

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html

https://www.javatpoint.com/hadoop-tutorial

https://howtodoinjava.com/hadoop/hadoop-big-data-tutorial/

https://www.guru99.com/bigdata-tutorials.html

11.8 Java Hadoop Example

You can also write your own Java program for Hadoop. Example 11.1 shows a simple Java program that can do word counting, as described earlier. It has three classes: The MyMapper class represents the Map phase of Hadoop MapReduce, which implements the map() method to count the number of words in each document. The MyReducer class represents the Reduce phase of Hadoop MapReduce, which implements the reducer() method to aggregate the counts in each document into word counts of all the documents. The main class calls the MyMapper class and the MyReducer class and displays the final results.

To compile the previous Java Hadoop program, you need to include the Hadoop class path into the compilation class path. From the Raspberry Pi Terminal window, type in the following command to display the content of the Hadoop class path:

$hadoop classpath
/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre//lib/tools.jar:/opt/hadoop/contrib/capacity-scheduler/*.jar

Type in the following command to compile the Java program, which includes the Hadoop class path:

$javac -cp $(hadoop classpath) MyWordCounter.java

Enter the following command to list all the class files that have been generated:

$ls -la My*.class
-rw-r--r-- 1 hduser hadoop 1682 Apr 10 04:45  MyMapper.class
-rw-r--r-- 1 hduser hadoop 1690 Apr 10 04:45  MyReducer.class
-rw-r--r-- 1 hduser hadoop 1728 Apr 10 04:45  MyWordCounter.class
-rw-r--r-- 1 hduser hadoop 1734 Apr 10 04:41 'MyWordCounter$MyMapper.class'
-rw-r--r-- 1 hduser hadoop 1743 Apr 10 04:41 'MyWordCounter$MyReducer.class'  

Then use the following command to include all the class files into a JAR file called Test.jar:

$jar -cvf Test.jar My*.class

Now enter the following command to run the Test.jar file on the /license.txt file and save the result in a folder named /licentse-out.txt, exactly as we did previously.

$hadoop jar Test.jar MyWordCounter /license.txt /license-out.txt

Similarly, type in the following command to copy the license-out.txt file from the Hadoop distributed file system to the local file system, and open license-out.txt/part-r-00000 with nano:

$hdfs dfs -copyToLocal /license-out.txt ~/
$nano ~/license-out.txt/part-r-00000

Finally, type in the following command to delete the /licentse-out.txt directory:

$hdfs dfs -rm -r /license-out.txt

Example 11.2 shows a variation of the previous Java program. It counts the number of characters in documents. The main difference is in the MyMapper class, where it counts the characters rather than the words. The MyReducer class and the main class are quite similar. Also different from the previous example, both the MyMapper class and the MyReducer class extend from the MapReduceBase class, which is common for earlier versions of Hadoop. You can compile and run this program exactly the same as the previous program.

Let's look at another Java Hadoop program. Assume for the purpose of a simple example that a text file named grades.txt contains the student marks for four different subjects as shown in the following excerpt. In each line, data are tab-separated. The first item is the student's given name, and the second to fifth are the student's marks in the respective subjects.

Tony         56   76   83   42
William      33   91   82   73
Alan         76   39   65   89
Tom          51   68   77   52
John         88   54   94   98

Example 11.3 shows another Java Hadoop program, which reads the student marks from a file and counts the total marks for each student. Again, there are three classes. The MyMapper class counts the characters rather than the words. The MyReducer class and the main class are quite similar.

Type in the following commands to compile and to run the Java program:

$javac -cp $(hadoop classpath) MyMarker.java
$jar -cvf Test.jar My*.class
$hdfs dfs -copyFromLocal ~/marks.txt /marks.txt
$hadoop jar Test.jar MyMarker /marks.txt /marks-out.txt
$hdfs dfs -copyToLocal /marks-out.txt ~/
$cat ~/marks-out.txt/part-r-00000

Finally, type in the following command to delete the /marks-out.txt directory:

$hdfs dfs -rm -r /marks-out.txt

11.9 Summary

This chapter introduced the concept of big data, the sources of big data, the three Vs of big data, and the benefits of big data. It also introduced Hadoop, the open source software for big data, and showed how to download, set up, and use Hadoop software on Raspberry Pi.

11.10 Chapter Review Questions

Q11.1.What is big data?
Q11.2.Explain terms of kilobyte, megabyte, gigabyte, terabyte, petabyte, exabyte, and zettabyte.
Q11.3.What are the sources of big data?
Q11.4.What are the three Vs of big data?
Q11.5.What is Hadoop?
Q11.6.What are the key components of Hadoop?
Q11.7.What are HDFS, MapReduce, Hadoop Common, and Hadoop YARN?
Q11.8.What are the Hadoop five services?
Q11.9.How do you set up SSH for Hadoop?
Q11.10.How do you start and stop Hadoop?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset