Building and running standalone programs

So far, we have interacted exclusively with Spark through the Spark shell. In the section that follows, we will build a standalone application and launch a Spark program either locally or on an EC2 cluster.

Running Spark applications locally

The first step is to write the build.sbt file, as you would if you were running a standard Scala script. The Spark binary that we downloaded needs to be run against Scala 2.10 (You need to compile Spark from source to run against Scala 2.11. This is not difficult to do, just follow the instructions on http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211).

// build.sbt file

name := "spam_mi"

scalaVersion := "2.10.5"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.4.1"
)

We then run sbt package to compile and build a jar of our program. The jar will be built in target/scala-2.10/, and called spam_mi_2.10-0.1-SNAPSHOT.jar. You can try this with the example code provided for this chapter.

We can then run the jar locally using the spark-submit shell script, available in the bin/ folder in the Spark installation directory:

$ spark-submit target/scala-2.10/spam_mi_2.10-0.1-SNAPSHOT.jar
... runs the program

The resources allocated to Spark can be controlled by passing arguments to spark-submit. Use spark-submit --help to see the full list of arguments.

If the Spark programs has dependencies (for instance, on other Maven packages), it is easiest to bundle them into the application jar using the SBT assembly plugin. Let's imagine that our application depends on breeze-viz. The build.sbt file now looks like:

// build.sbt

name := "spam_mi"

scalaVersion := "2.10.5"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.5.2" % "provided",
  "org.scalanlp" %% "breeze" % "0.11.2",
  "org.scalanlp" %% "breeze-viz" % "0.11.2",
  "org.scalanlp" %% "breeze-natives" % "0.11.2"
)

SBT assembly is an SBT plugin that builds fat jars: jars that contain not only the program itself, but all the dependencies for the program.

Note that we marked Spark as "provided" in the list of dependencies, which means that Spark itself will not be included in the jar (it is provided by the Spark environment anyway). To include the SBT assembly plugin, create a file called assembly.sbt in the project/ directory, with the following line:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.0")

You will need to re-start SBT for the changes to take effect. You can then create the assembly jar using the assembly command in SBT. This will create a jar called spam_mi-assembly-0.1-SNAPSHOT.jar in the target/scala-2.10 directory. You can run this jar using spark-submit.

Reducing logging output and Spark configuration

Spark is, by default, very verbose. The default log-level is set to INFO. To avoid missing important messages, it is useful to change the log settings to WARN. To change the default log level system-wide, go into the conf directory in the directory in which you installed Spark. You should find a file called log4j.properties.template. Rename this file to log4j.properties and look for the following line:

log4j.rootCategory=INFO, console

Change this line to:

log4j.rootCategory=WARN, console

There are several other configuration files in that directory that you can use to alter Spark's default behavior. For a full list of configuration options, head over to http://spark.apache.org/docs/latest/configuration.html.

Running Spark applications on EC2

Running Spark locally is useful for testing, but the whole point of using a distributed framework is to run programs harnessing the power of several different computers. We can set Spark up on any set of computers that can communicate with each other using HTTP. In general, we also need to set up a distributed file system like HDFS, so that we can share input files across the cluster. For the purpose of this example, we will set Spark up on an Amazon EC2 cluster.

Spark comes with a shell script, ec2/spark-ec2, for setting up an EC2 cluster and installing Spark. It will also install HDFS. You will need an account with Amazon Web Services (AWS) to follow these examples (https://aws.amazon.com). You will need the AWS access key and secret key, which you can access through the Account / Security Credentials / Access Credentials menu in the AWS web console. You need to make these available to the spark-ec2 script through environment variables. Inject them into your current session as follows:

$ export AWS_ACCESS_KEY_ID=ABCDEF...
$ export AWS_SECRET_ACCESS_KEY=2dEf...

You can also write these lines into the configuration script for your shell (your .bashrc file, or equivalent), to avoid having to re-enter them every time you run the setup-ec2 script. We discussed environment variables in Chapter 6, Slick – A Functional Interface for SQL.

You will also need to create a key pair by clicking on Key Pairs in the EC2 web console, creating a new key pair and downloading the certificate file. I will assume you named the key pair test_ec2 and the certificate file test_ec2.pem. Make sure that the key pair is created in the N. Virginia region (by choosing the correct region in the upper right corner of the EC2 Management console), to avoid having to specify the region explicitly in the rest of this chapter. You will need to set access permissions on the certificate file to user-readable only:

$ chmod 400 test_ec2.pem

We are now ready to launch the cluster. Navigate to the ec2 directory and run:

$ ./spark-ec2 -k test_ec2 -i ~/path/to/certificate/test_ec2.pem -s 2 launch test_cluster

This will create a cluster called test_cluster with a master and two slaves. The number of slaves is set through the -s command line argument. The cluster will take a while to start up, but you can verify that the instances are launching correctly by looking at the Instances window in the EC2 Management Console.

The setup script supports many options for customizing the type of instances, the number of hard drives and so on. You can explore these options by passing the --help command line option to spark-ec2.

The life cycle of the cluster can be controlled by passing different commands to the spark-ec2 script, such as:

# shut down 'test_cluster'
$ ./spark-ec2 stop test_cluster

# start 'test_cluster'
$ ./spark-ec2 -i test_ec2.pem start test_cluster

# destroy 'test_cluster'
$ ./spark-ec2 destroy test_cluster

For more detail on using Spark on EC2, consult the official documentation at http://spark.apache.org/docs/latest/ec2-scripts.html#running-applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset