Spark is Apache Hadoop's successor. Therefore, it would be better to install and work Spark into a Linux-based system even though you can also try on Windows and Mac OS. It is also very possible to configure your Eclipse environment to work with Spark as a Maven project on any OS and bundle your applications as a jar file with all the dependencies. Secondly, you can try running an application from the Spark shell (Scala shell to be more specific) following the same fashion as SQL or R programming:
The third way is from the command line (Windows)/Terminal (Linux/Mac OS). At first you need to write your ML application using Scala or Java and prepare the jar file with the required dependencies. Then the jar file can be submitted to a cluster to compute a Spark job.
We will show how to develop and deploy a Spark ML application in three ways. However, the very first perquisite is to prepare your Spark application development environment. You can install and configure Spark on a number of operating systems, including:
Please check, the Spark website at https://spark.apache.org/documentation.html for the Spark version and OS related documentation. The following steps show you how to install and configure Spark on Ubuntu 14.04 (64-bit). Please note that Spark 2.0.0 runs on Java 7+, Python 2.6+/3.4+, and R 3.1+. For the Scala API, Spark 2.0.0 uses Scala 2.11. Therefore, you will need to use a compatible Scala version (2.11.x).
Step 1: Java installation
Java installation should be considered as one of the mandatory requirements in installing Spark since Java and Scala-based APIs require having a Java virtual machine installed on the system. Try the following command to verify the Java version:
$ java -version
If Java is already installed on your system, you should see the following message:
java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
In case you do not have Java installed on your system, make sure you install Java before proceeding to the next step. Please note that to avail and enjoy the lambda expression support it is recommended to install Java 8 on your system, preferably JDK and JRE both. Although for Spark 1.6.2 and prior releases Java 7 should be enough:
$ sudo apt-add-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer
After installing, don't forget to set JAVA_HOME
. Just apply the following commands (we assume Java is installed at /usr/lib/jvm/java-8-oracle
):
$ echo "export JAVA_HOME=/usr/lib/jvm/java-8-oracle" >> ~/.bashrc $ echo "export PATH=$PATH:$JAVA_HOME/bin" >> ~/.bashrc
You can add these environmental variables manually in the.bashrc
file located in the home directory. If you cannot find the file, probably it is hidden so it needs to be explored. Just go to the view tab and enable the Show hidden file.
Step 2: Scala installation
Spark is written in Scala itself; therefore, you should have Scala installed on your system. Checking this is so straight forward by using the following command:
$ scala -version
If Scala is already installed on your system, you should get the following message on the terminal:
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
Note that during the writing of this installation, we used the latest version of Scala, that is 2.11.8. In case you do not have Scala installed on your system, make sure you install it, so before proceeding to the next step, you can download the latest version of Scala from the Scala website at http://www.scala-lang.org/download/. After the download has finished, you should find the Scala tar
file in the download folder:
tar
file by extracting, from its location or type the following command for extracting the Scala tar file from the terminal: $ tar -xvzf scala-2.11.8.tgz
/usr/local/scala
) by the following command or do it manually:$ cd /home/Downloads/ $ mv scala-2.11.8 /usr/local/scala
$ echo "export SCALA_HOME=/usr/local/scala/scala-2.11.8" >> ~/.bashrc $ echo "export PATH=$PATH:$SCALA_HOME/bin" >> ~/.bashrc
$ scala -version
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
Step 3: Installing Spark
Download the latest version of Spark from the Apace Spark website at https://spark.apache.org/downloads.html. For this installation step, we used the latest Spark stable release 2.0.0 version pre-built for Hadoop 2.7 and later. After the download has finished, you will find the Spark tar
file in the download folder:
tar
file by extracting it from its location or type the following command for extracting the Scala tar
file from the terminal: $ tar -xvzf spark-2.0.0-bin-hadoop2.7.tgz
/usr/local/spark
) by the following command or do it manually:$ cd /home/Downloads/ $ mv spark-2.0.0-bin-hadoop2.7 /usr/local/spark
$ echo "export SPARK_HOME=/usr/local/spark/spark-2.0.0-bin-hadoop2.7" >> ~/.bashrc $ echo "export PATH=$PATH:$SPARK_HOME/bin" >> ~/.bashrc
Step 4: Making all the changes permanent
Source the ~/.bashrc
file using the following command to make the changes permanent:
$ source ~/.bashrc
If you execute the $ vi ~/. bashrc
command, you will see the following entry in your bashrc
file as follows:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin: /usr/lib/jvm/java-8-oracle/bin export SCALA_HOME=/usr/local/scala/scala-2.11.8 export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin: /bin export SPARK_HOME=/usr/local/spark/spark-2.0.0-bin-hadoop2.7 export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/usr/lib/jvm/java-8-oracle/db/bin:/usr/lib/jvm/java-8-oracle/jre/bin: /bin
Step 5: Verifying the Spark installation
The verification of the Spark installation is shown in the following screenshot:
Write the following command to open the Spark shell to verify if Spark has been configured successfully:
$ spark-shell
If Spark is installed successfully, you should see the following message (Figure 6).
The Spark server will start on localhost at port 4040
, more precisely at http://localhost:4040/
(Figure 7). Just move there to make sure if it's really running:
Well done! Now you are ready to start writing the Scala code on the Spark shell.