In Chapter 4, Parallel Collections and Futures, we discovered how to use parallel collections for "embarrassingly" parallel problems: problems that can be broken down into a series of tasks that require no (or very little) communication between the tasks.
Apache Spark provides behavior similar to Scala parallel collections (and much more), but, instead of distributing tasks across different CPUs on the same computer, it allows the tasks to be distributed across a computer cluster. This provides arbitrary horizontal scalability, since we can simply add more computers to the cluster.
In this chapter, we will learn the basics of Apache Spark and use it to explore a set of emails, extracting features with the view of building a spam filter. We will explore several ways of actually building a spam filter in Chapter 12, Distributed Machine Learning with MLlib.
In previous chapters, we included dependencies by specifying them in a build.sbt
file, and relying on SBT to fetch them from the Maven Central repositories. For Apache Spark, downloading the source code or pre-built binaries explicitly is more common, since Spark ships with many command line scripts that greatly facilitate launching jobs and interacting with a cluster.
Head over to http://spark.apache.org/downloads.html and download Spark 1.5.2, choosing the "pre-built for Hadoop 2.6 or later" package. You can also build Spark from source if you need customizations, but we will stick to the pre-built version since it requires no configuration.
Clicking Download will download a tarball, which you can unpack with the following command:
$ tar xzf spark-1.5.2-bin-hadoop2.6.tgz
This will create a spark-1.5.2-bin-hadoop2.6
directory. To verify that Spark works correctly, navigate to spark-1.5.2-bin-hadoop2.6/bin
and launch the Spark shell using ./spark-shell
. This is just a Scala shell with the Spark libraries loaded.
You may want to add the bin/
directory to your system path. This will let you call the scripts in that directory from anywhere on your system, without having to reference the full path. On Linux or Mac OS, you can add variables to the system path by entering the following line in your shell configuration file (.bash_profile
on Mac OS, and .bashrc
or .bash_profile
on Linux):
export PATH=/path/to/spark/bin:$PATH
The changes will take effect in new shell sessions. On Windows (if you use PowerShell), you need to enter this line in the profile.ps1
file in the WindowsPowerShell
folder in Documents
:
$env:Path += ";C:Program FilesGnuWin32in"
If this worked correctly, you should be able to open a Spark shell in any directory on your system by just typing spark-shell
in a terminal.