Installing and getting started with Spark

Spark is Apache Hadoop's successor. Therefore, it would be better to install and work Spark into a Linux-based system even though you can also try on Windows and Mac OS. It is also very possible to configure your Eclipse environment to work with Spark as a Maven project on any OS and bundle your applications as a jar file with all the dependencies. Secondly, you can try running an application from the Spark shell (Scala shell to be more specific) following the same fashion as SQL or R programming:

The third way is from the command line (Windows)/Terminal (Linux/Mac OS). At first you need to write your ML application using Scala or Java and prepare the jar file with the required dependencies. Then the jar file can be submitted to a cluster to compute a Spark job.

We will show how to develop and deploy a Spark ML application in three ways. However, the very first perquisite is to prepare your Spark application development environment. You can install and configure Spark on a number of operating systems, including:

  • Windows (XP/7/8/10)
  • Mac OS X (10.4.7+)
  • Linux distribution (including Debian, Ubuntu, Fedora, RHEL, CentOS, and so on)

Note

Please check, the Spark website at https://spark.apache.org/documentation.html for the Spark version and OS related documentation. The following steps show you how to install and configure Spark on Ubuntu 14.04 (64-bit). Please note that Spark 2.0.0 runs on Java 7+, Python 2.6+/3.4+, and R 3.1+. For the Scala API, Spark 2.0.0 uses Scala 2.11. Therefore, you will need to use a compatible Scala version (2.11.x).

Step 1: Java installation

Java installation should be considered as one of the mandatory requirements in installing Spark since Java and Scala-based APIs require having a Java virtual machine installed on the system. Try the following command to verify the Java version:

$ java -version 

If Java is already installed on your system, you should see the following message:

java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

In case you do not have Java installed on your system, make sure you install Java before proceeding to the next step. Please note that to avail and enjoy the lambda expression support it is recommended to install Java 8 on your system, preferably JDK and JRE both. Although for Spark 1.6.2 and prior releases Java 7 should be enough:

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

After installing, don't forget to set JAVA_HOME. Just apply the following commands (we assume Java is installed at /usr/lib/jvm/java-8-oracle):

$ echo "export JAVA_HOME=/usr/lib/jvm/java-8-oracle" >> ~/.bashrc  
$ echo "export PATH=$PATH:$JAVA_HOME/bin" >> ~/.bashrc

You can add these environmental variables manually in the.bashrc file located in the home directory. If you cannot find the file, probably it is hidden so it needs to be explored. Just go to the view tab and enable the Show hidden file.

Step 2: Scala installation

Spark is written in Scala itself; therefore, you should have Scala installed on your system. Checking this is so straight forward by using the following command:

$ scala -version

If Scala is already installed on your system, you should get the following message on the terminal:

Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

Note that during the writing of this installation, we used the latest version of Scala, that is 2.11.8. In case you do not have Scala installed on your system, make sure you install it, so before proceeding to the next step, you can download the latest version of Scala from the Scala website at http://www.scala-lang.org/download/. After the download has finished, you should find the Scala tar file in the download folder:

  1. Extract the Scala tar file by extracting, from its location or type the following command for extracting the Scala tar file from the terminal:
        $ tar -xvzf scala-2.11.8.tgz 
    
  2. Now move the Scala distribution to the user’s perspective (for example, /usr/local/scala) by the following command or do it manually:
        $ cd /home/Downloads/ 
        $ mv scala-2.11.8 /usr/local/scala 
    
  3. Set the Scala home:
    $ echo "export SCALA_HOME=/usr/local/scala/scala-2.11.8" >> 
            ~/.bashrc  
    $ echo "export PATH=$PATH:$SCALA_HOME/bin" >> ~/.bashrc
    
  4. After installation has been completed, you should verify it using the following command:
        $ scala -version
    
  5. If Scala has successfully been configured on your system, you should get the following message on your terminal:
    Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
    

Step 3: Installing Spark

Download the latest version of Spark from the Apace Spark website at https://spark.apache.org/downloads.html. For this installation step, we used the latest Spark stable release 2.0.0 version pre-built for Hadoop 2.7 and later. After the download has finished, you will find the Spark tar file in the download folder:

  1. Extract the Scala tar file by extracting it from its location or type the following command for extracting the Scala tar file from the terminal:
        $ tar -xvzf spark-2.0.0-bin-hadoop2.7.tgz  
    
  2. Now move the Scala distribution to the user's perspective (for example, /usr/local/spark) by the following command or do it manually:
        $ cd /home/Downloads/ 
        $ mv spark-2.0.0-bin-hadoop2.7 /usr/local/spark 
    
  3. To set after Spark installing, just apply the following commands:
    $ echo "export SPARK_HOME=/usr/local/spark/spark-2.0.0-bin-hadoop2.7" >>
          ~/.bashrc  
    $ echo "export PATH=$PATH:$SPARK_HOME/bin" >> ~/.bashrc
    

Step 4: Making all the changes permanent

Source the ~/.bashrc file using the following command to make the changes permanent:

$ source ~/.bashrc

If you execute the $ vi ~/. bashrc command, you will see the following entry in your bashrc file as follows:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin: /usr/lib/jvm/java-8-oracle/bin
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin: /bin
export SPARK_HOME=/usr/local/spark/spark-2.0.0-bin-hadoop2.7
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin: /bin

Step 5: Verifying the Spark installation

The verification of the Spark installation is shown in the following screenshot:

Installing and getting started with Spark

Figure 6: The Spark shell confirms the successful Spark installation.

Write the following command to open the Spark shell to verify if Spark has been configured successfully:

$ spark-shell

If Spark is installed successfully, you should see the following message (Figure 6).

The Spark server will start on localhost at port 4040, more precisely at http://localhost:4040/ (Figure 7). Just move there to make sure if it's really running:

Installing and getting started with Spark

Figure 7: Spark is running as a local web server.

Well done! Now you are ready to start writing the Scala code on the Spark shell.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset