In this section, we will learn to set up Spark:
Each storage backend serves a specific purpose depending on the nature of the data to be handled. The MySQL RDBMs is used for standard tabular processed information that can be easily queried using SQL. As we will be processing a lot of JSON-type data from various APIs, the easiest way to store them is in a document. For real-time and time-series-related information, Cassandra is best suited as a columnar database.
The following diagram gives a view of the environment we will build and use throughout the book:
Setting up a clean new VirtualBox environment on Ubuntu 14.04 is the safest way to create a development environment that does not conflict with existing libraries and can be later replicated in the cloud using a similar list of commands.
In order to set up an environment with Anaconda and Spark, we will create a VirtualBox virtual machine running Ubuntu 14.04.
Let's go through the steps of using VirtualBox with Ubuntu:
PySpark currently runs only on Python 2.7. (There are requests from the community to upgrade to Python 3.3.) To install Anaconda, follow these steps:
2.x.x
in the command with the version number of the downloaded installer file:# install anaconda 2.x.x bash Anaconda-2.x.x-Linux-x86[_64].sh
defaults to ~/anaconda
).# add anaconda to PATH bash Anaconda-2.x.x-Linux-x86[_64].sh
Spark runs on the JVM and requires the Java SDK (short for Software Development Kit) and not the JRE (short for Java Runtime Environment), as we will build apps with Spark. The recommended version is Java Version 7 or higher. Java 8 is the most suitable, as it includes many of the functional programming techniques available with Scala and Python.
To install Java 8, follow these steps:
# install oracle java 8 $ sudo apt-get install software-properties-common $ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer
JAVA_HOME
environment variable and ensure that the Java program is on your PATH.JAVA_HOME
is properly installed:# $ echo JAVA_HOME
Head over to the Spark download page at http://spark.apache.org/downloads.html.
The Spark download page offers the possibility to download earlier versions of Spark and different package and download types. We will select the latest release, pre-built for Hadoop 2.6 and later. The easiest way to install Spark is to use a Spark package prebuilt for Hadoop 2.6 and later, rather than build it from source. Move the file to the directory ~/spark
under the root directory.
Download the latest release of Spark—Spark 1.5.2, released on November 9, 2015:
This can also be accomplished by running:
# download spark $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz
Next, we'll extract the files and clean up:
# extract, clean up, move the unzipped files under the spark directory $ tar -xf spark-1.5.2-bin-hadoop2.6.tgz $ rm spark-1.5.2-bin-hadoop2.6.tgz $ sudo mv spark-* spark
Now, we can run the Spark Python interpreter with:
# run spark $ cd ~/spark ./bin/pyspark
You should see something like this:
Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_ version 1.5.2 /_/ Using Python version 2.7.6 (default, Mar 22 2014 22:59:56) SparkContext available as sc. >>>
The interpreter will have already provided us with a Spark context object, sc
, which we can see by running:
>>> print(sc) <pyspark.context.SparkContext object at 0x7f34b61c4e50>
We will work with IPython Notebook for a friendlier user experience than the console.
You can launch IPython Notebook by using the following command:
$ IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark
Launch PySpark with IPYNB
in the directory examples/AN_Spark
where Jupyter or IPython Notebooks are stored:
# cd to /home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark # launch command using python 2.7 and the spark-csv package: $ IPYTHON_OPTS='notebook' /home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.2.0 # launch command using python 3.4 and the spark-csv package: $ IPYTHON_OPTS='notebook' PYSPARK_PYTHON=python3 /home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.2.0