Apache Spark is a cluster-computing platform for the processing of large distributed datasets. Data processing in Spark is both fast and easy, thanks to its optimized parallel computation engine and its flexible and unified API. The core abstraction in Spark is based on the concept of Resilient Distributed Dataset (RDD). By extending the MapReduce framework, Spark's Core API makes analytics jobs easier to write. On top of the Core API, Spark offers an integrated set of high-level libraries that can be used for specialized tasks such as graph processing or machine learning. In particular, GraphX is the library to perform graph-parallel processing in Spark.
This chapter will introduce you to Spark and GraphX by building a social network and exploring the links between people in the network. In addition, you will learn to use the Scala Build Tool (SBT) to build and run a Spark program. By the end of this chapter, you will know how to:
In the following section, we will go through the Spark installation process in detail. Spark is built on Scala and runs on the Java Virtual Machine (JVM). Before installing Spark, you should first have Java Development Kit 7 (JDK) installed on your computer.
Make sure you install JDK instead of Java Runtime Environment (JRE). You can download it from http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html.
Next, download the latest release of Spark from the project website https://spark.apache.org/downloads.html. Perform the following three steps to get Spark installed on your computer:
spark-1.4.1-bin-hadoop2.6.tgz
and place it into a directory on your computer.spark-1.4.1
, and then list the installed files and subdirectories:tar -xf spark-1.4.1-bin-hadoop2.6.tgz mv spark-1.4.1-bin-hadoop2.6 spark-1.4.1 cd spark-1.4.1 ls
That's it! You now have Spark and its libraries installed on your computer. Note the following files and directories in the spark-1.4.1
home folder:
core
: This directory contains the source code for the core components and API of Sparkbin
: This directory contains the executable files that are used to submit and deploy Spark applications or also to interact with Spark in a Spark shellgraphx
, mllib
, sql
, and streaming
: These are Spark libraries that provide a unified interface to do different types of data processing, namely graph processing, machine learning, queries, and stream processingexamples
: This directory contains demos and examples of Spark applicationsIt is often convenient to create shortcuts to the Spark home folder and Spark example folders. In Linux or Mac, open or create the ~/.bash_profile
file in your home folder and insert the following lines:
export SPARKHOME="/[Where you put Spark]/spark-1.4.1/" export SPARKSCALAEX="ls ../spark- 1.4.1/examples/src/main/scala/org/apache/spark/examples/"
Then, execute the following command for the previous shortcuts to take effect:
source ~/.bash_profile
As a result, you can quickly access these folders in the terminal or Spark shell. For example, the example named LiveJournalPageRank.scala
can be accessed with:
$SPARKSCALAEX/graphx/LiveJournalPageRank.scala