Setting up the Spark powered environment

In this section, we will learn to set up Spark:

  • Create a segregated development environment in a virtual machine running on Ubuntu 14.04, so it does not interfere with any existing system.
  • Install Spark 1.3.0 with its dependencies, namely.
  • Install the Anaconda Python 2.7 environment with all the required libraries such as Pandas, Scikit-Learn, Blaze, and Bokeh, and enable PySpark, so it can be accessed through IPython Notebooks.
  • Set up the backend or data stores of our environment. We will use MySQL as the relational database, MongoDB as the document store, and Cassandra as the columnar database.

Each storage backend serves a specific purpose depending on the nature of the data to be handled. The MySQL RDBMs is used for standard tabular processed information that can be easily queried using SQL. As we will be processing a lot of JSON-type data from various APIs, the easiest way to store them is in a document. For real-time and time-series-related information, Cassandra is best suited as a columnar database.

The following diagram gives a view of the environment we will build and use throughout the book:

Setting up the Spark powered environment

Setting up an Oracle VirtualBox with Ubuntu

Setting up a clean new VirtualBox environment on Ubuntu 14.04 is the safest way to create a development environment that does not conflict with existing libraries and can be later replicated in the cloud using a similar list of commands.

In order to set up an environment with Anaconda and Spark, we will create a VirtualBox virtual machine running Ubuntu 14.04.

Let's go through the steps of using VirtualBox with Ubuntu:

  1. Oracle VirtualBox VM is free and can be downloaded from https://www.virtualbox.org/wiki/Downloads. The installation is pretty straightforward.
  2. After installing VirtualBox, let's open the Oracle VM VirtualBox Manager and click the New button.
  3. We'll give the new VM a name, and select Type Linux and Version Ubuntu (64 bit).
  4. You need to download the ISO from the Ubuntu website and allocate sufficient RAM (4 GB recommended) and disk space (20 GB recommended). We will use the Ubuntu 14.04.1 LTS release, which is found here: http://www.ubuntu.com/download/desktop.
  5. Once the installation completed, it is advisable to install the VirtualBox Guest Additions by going to (from the VirtualBox menu, with the new VM running) Devices | Insert Guest Additions CD image. Failing to provide the guest additions in a Windows host gives a very limited user interface with reduced window sizes.
  6. Once the additional installation completes, reboot the VM, and it will be ready to use. It is helpful to enable the shared clipboard by selecting the VM and clicking Settings, then go to General | Advanced | Shared Clipboard and click on Bidirectional.

Installing Anaconda with Python 2.7

PySpark currently runs only on Python 2.7. (There are requests from the community to upgrade to Python 3.3.) To install Anaconda, follow these steps:

  1. Download the Anaconda Installer for Linux 64-bit Python 2.7 from http://continuum.io/downloads#all.
  2. After downloading the Anaconda installer, open a terminal and navigate to the directory or folder where the installer has been saved. From here, run the following command, replacing the 2.x.x in the command with the version number of the downloaded installer file:
    # install anaconda 2.x.x
    bash Anaconda-2.x.x-Linux-x86[_64].sh
    
  3. After accepting the license terms, you will be asked to specify the install location (which defaults to ~/anaconda).
  4. After the self-extraction is finished, you should add the anaconda binary directory to your PATH environment variable:
    # add anaconda to PATH
    bash Anaconda-2.x.x-Linux-x86[_64].sh
    

Installing Java 8

Spark runs on the JVM and requires the Java SDK (short for Software Development Kit) and not the JRE (short for Java Runtime Environment), as we will build apps with Spark. The recommended version is Java Version 7 or higher. Java 8 is the most suitable, as it includes many of the functional programming techniques available with Scala and Python.

To install Java 8, follow these steps:

  1. Install Oracle Java 8 using the following commands:
    # install oracle java 8
    $ sudo apt-get install software-properties-common
    $ sudo add-apt-repository ppa:webupd8team/java
    $ sudo apt-get update
    $ sudo apt-get install oracle-java8-installer
    
  2. Set the JAVA_HOME environment variable and ensure that the Java program is on your PATH.
  3. Check that JAVA_HOME is properly installed:
    # 
    $ echo JAVA_HOME
    

Installing Spark

Head over to the Spark download page at http://spark.apache.org/downloads.html.

The Spark download page offers the possibility to download earlier versions of Spark and different package and download types. We will select the latest release, pre-built for Hadoop 2.6 and later. The easiest way to install Spark is to use a Spark package prebuilt for Hadoop 2.6 and later, rather than build it from source. Move the file to the directory ~/spark under the root directory.

Download the latest release of Spark—Spark 1.5.2, released on November 9, 2015:

  1. Select Spark release 1.5.2 (Nov 09 2015),
  2. Chose the package type Prebuilt for Hadoop 2.6 and later,
  3. Chose the download type Direct Download,
  4. Download Spark: spark-1.5.2-bin-hadoop2.6.tgz,
  5. Verify this release using the 1.3.0 signatures and checksums,

This can also be accomplished by running:

# download spark
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz

Next, we'll extract the files and clean up:

# extract, clean up, move the unzipped files under the spark directory
$ tar -xf spark-1.5.2-bin-hadoop2.6.tgz
$ rm spark-1.5.2-bin-hadoop2.6.tgz
$ sudo mv spark-* spark

Now, we can run the Spark Python interpreter with:

# run spark
$ cd ~/spark
./bin/pyspark

You should see something like this:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_   version 1.5.2
      /_/
Using Python version 2.7.6 (default, Mar 22 2014 22:59:56)
SparkContext available as sc.
>>> 

The interpreter will have already provided us with a Spark context object, sc, which we can see by running:

>>> print(sc)
<pyspark.context.SparkContext object at 0x7f34b61c4e50>

Enabling IPython Notebook

We will work with IPython Notebook for a friendlier user experience than the console.

You can launch IPython Notebook by using the following command:

$ IPYTHON_OPTS="notebook --pylab inline"  ./bin/pyspark

Launch PySpark with IPYNB in the directory examples/AN_Spark where Jupyter or IPython Notebooks are stored:

# cd to  /home/an/spark/spark-1.5.0-bin-hadoop2.6/examples/AN_Spark
# launch command using python 2.7 and the spark-csv package:
$ IPYTHON_OPTS='notebook' /home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.2.0

# launch command using python 3.4 and the spark-csv package:
$ IPYTHON_OPTS='notebook' PYSPARK_PYTHON=python3
 /home/an/spark/spark-1.5.0-bin-hadoop2.6/bin/pyspark --packages com.databricks:spark-csv_2.11:1.2.0
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset