Scala Build Tool revisited

Previously, we have used the Scala console to interact with Spark. If we want to build a standalone application instead, it becomes unwieldy to manually manage the third-party library dependencies. Remember that first we had to download the JAR files for GraphStream and BreezeViz, as well as those of the libraries that they depend on. Then, we had to put them in the /lib folder and specify this list of JAR files when we submitted the Spark application using the --jars option. This process becomes extremely cumbersome when the application reuses many third-party libraries, which may also depend on several libraries. Fortunately, we can automate this process with SBT. Let's see how to manage the library dependencies, and how to create an uber JAR or assembly JAR with SBT. If you already know how to do this, feel free to skip this section and go ahead to the next chapter.

Organizing build definitions

SBT offers flexibility and power in defining builds and tracking library dependencies. In addition, SBT makes the build process reproducible and interactive. Despite this flexibility, learning all its features can be very discouraging to the unfamiliar user. Instead, we will focus on the essentials.

First, SBT assumes the same directory structure as Maven for the Spark project's source files, which is as follows:

src/
  main/
    resources/
<files to include in main jar here>
    scala/
<main Scala sources>
  test/
    resources
<files to include in test jar here>
    scala/
<test Scala sources>

These paths are relative to the project's base directory. On the other hand, build definitions can be put in different files, and can be organized recursively within the project structure. Specifically, there are three places where we can put build definitions:

  • A multi-project .sbt build file is recommended in situations where multiple related projects share common settings and dependencies which can be defined in a single build.
  • The bare .sbt build files are useful for simple projects. Each .sbt build file defines a list of build and project settings.
  • The .scala builds files are combined with the .sbt files to form the complete build definition. Prior to SBT 0.13, this was the old way to share common settings between the multiple projects.

In this book, we will work on simple projects, and the bare .sbt build files will suffice. For details about the mentioned options, refer to the tutorial at http://www.scala-sbt.org/0.13/tutorial/Basic-Def.html.

Managing library dependencies

We can manage library dependencies manually or automatically. In manual mode, we will have to download all libraries in the dependency graph, and then manually copy them in the lib folder. In automatic mode, SBT handles all the work for us by leveraging Apache Ivy mechanisms behind the scenes. With this second method, we need to define three important settings in an SBT build file:

  • Dependencies: These are the libraries that our application depends on
  • Resolvers: These are the repositories' locations where SBT will look for the JAR files of these libraries
  • SBT: These are the plugin settings

    Note

    The third set of settings is needed if we want to extend the build definitions using SBT plugins. For example, we will use the sbt-assembly plugin to package a Spark application and the JAR files, it depends on, into a single "uber JAR" file. For this, we need to specify some extra settings such as the uber JAR name as well as the options for creating the uber JAR.

Once we have declared these settings, SBT will take care of the rest for us. Let's look at a concrete example to make sense of all this. We are going to build a Spark application that loads and visualizes food ingredient networks. Earlier in this chapter, we have used the Spark shell and manually managed the dependencies. This time, we will create a standalone application and handle the dependencies automatically.

A preview of the steps

As a preview, here are the steps that we will take to build the Spark application:

  1. Create the plugins.sbt file inside the /project folder. Specify the sbt-assembly plugin in that file.
  2. Create a build.sbt file in the base directory, and declare the project settings.
  3. Specify the library dependencies and resolvers.
  4. Set up the sbt-assembly plugin.
  5. Use the SBT commands to assemble the uber JAR.

Step 1 – Enable the sbt-assembly plugin

First, let's enable the sbt-assembly plugin. This plugin creates a single deployable uber JAR that contains our built application, and all the libraries that we depend on (except some that we will intentionally exclude from the build). So, let's create the plugins.sbt file inside a new/project folder. The filename is not important, but it has to be inside the /project folder. Then, add this line in the file:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")

Step 2 – Create a build.sbt file

Now, create another .sbt file and put it in the base directory. Let's give it a meaningful name, say, build.sbt file. As mentioned before, this single file will suffice for our simple project. For more complex ones, it is okay to put the definitions in the multiple .sbt files.

As we did in Chapter 1, Getting Started with Spark and GraphX, the first things we define in build.sbt are the project settings, that is the project name, its version, and the Scala version under which we will build the project. Add the following lines in build.sbt:

name := "Simple Visualization"

version := "1.0"

scalaVersion := "2.10.4"

The build.sbt file defines a sequence of build settings. Each element in the sequence is a key-value pair of type Setting[T], where T is the expected value type. Each line in build.sbt is then a Scala expression, which becomes one element in the sequence called Seq[Setting[_]]. For instance, in the expression name:= "Simple Visualization", the left-hand name is a key that has a type SettingKey[String]. Each key has a method called :=, which returns a Setting[T]. In our example, the return type of the full expression name := "Simple Visualization" is thus Setting[String]. In fact, this Scala expression is a syntactic sugar for the method call name—:=("Simple Visualization").

Tip

Do not forget to add empty lines between each setting. Since SBT uses a domain-specific language, the empty lines are mandatory to delineate the build expressions. These blank lines will no longer be needed after the release 0.13.7.

Step 3 – Declare library dependencies and resolvers

To manage the third-party libraries, we will need to attach these libraries to the key called libraryDependencies in build.sbt. Since an application depends on more than one library, the value type corresponding to libraryDependencies is a sequence. Therefore, libraryDependencies accepts the append method += to append a dependency, or the concatenate method ++= to add a list of dependencies. However, it does not accept the operator :=.

Our application depends on Spark Core, GraphX, GraphStream, and Breeze libraries. In build.sbt, we will attach a list of dependencies, which are as follows:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.1.0" % "provided",
  "org.apache.spark" %% "spark-graphx" % "1.1.0" % "provided",
  "org.graphstream" % "gs-core" % "1.2+",
  "org.graphstream" % "gs-ui" % "1.2+",
  "org.scalanlp" % "breeze-viz_2.10" % "0.9",
  "org.scalanlp" % "breeze_2.10" % "0.9"
)

Each sequence element in the right-hand side is a Scala expression that returns a ModuleID object. Each ModuleID object is constructed like this—groupID % artifactID % revision. The groupID, artifactID, and revision objects are all String objects.

In short, the % method creates the ModuleID objects from the passed strings, then we attach those ModuleID objects to the setting key libraryDependencies.

Tip

Each dependency must correspond with the version of Scala that you are using. For libraries that were built with SBT, such as spark-core and spark-graphx, we can use the operator %% instead of % as groupID %% artifactID % revision. This will use the right JAR for the dependency, built with the same version of Scala that you are using.

We can also add configuration information to the ModuleID like this:

groupID % artifactID % revision % configuration

For example, in "org.apache.spark" %% "spark-core" % "1.1.0" % "provided", the configuration provided will inform the plugin sbt-assembly to exclude JAR files when packaging the uber JAR.

Sometimes, there are pathological cases where two libraries depend on the same library with different versions, and SBT cannot resolve the dependency conflict. For instance, if you try to package and run the application with the build.sbt definition, you will get an error like this due to the unresolved dependencies:

[error] (*:assembly)  deduplicate: different file contents found in the following:
~/.ivy2/cache/org.jfree/jfreechart/jars/jfreechart- 1.0.14.jar:org/jfree/chart/ChartPanel.class
~/.ivy2/cache/jfree/jfreechart/jars/jfreechart- 1.0.13.jar:org/jfree/chart/ChartPanel.class

This error occurs because both the GraphStream and BreezeViz libraries depend on the Java libraries JFreeChart and JCommon. However, BreezeViz is rarely maintained and is stuck with the jfreechart-1.0.13 library. To fix this, we have to exclude one of every duplicate JARs. To exclude specific JARs in the dependency graph of a given library, we call one of the methods exclude and excludeAll on the ModuleID object. In our case, we replace the "org.scalanlp" % "breeze-viz_2.10" % "0.9" expression by:

("org.scalanlp" % "breeze-viz_2.10" % "0.9").
    exclude("jfree","jfreechart").
    exclude("jfree","jcommon")

The exclude method returns a new ModuleID object, but will not include the passed libraries in the final build.

After setting the dependencies, we have to tell SBT where it can download them. This is similarly done by attaching a sequence of repositories to the resolvers key as follows:

resolvers ++= Seq(
    "Akka Repository" at "http://repo.akka.io/releases/",
    "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots",
    "Sonatype Releases" at "http://oss.sonatype.org/content/repositories/releases")

Each repository is declared using the form called name at location, where the method is invoked on the String objects. By default, SBT combines these declared resolvers with the default ones, such as Maven Central or a local Ivy repository.

Step 4 – Set up the sbt-assembly plugin

Next, let's configure the settings of the sbt-assembly plugin. Put the following in build.sbt:

jarName in assembly := "graph-Viz-assembly.jar"

This configures the name of the uber JAR or assembly JAR to graph-Viz-assembly.jar.

We also need to exclude all the classes from the Scala language distribution. To do this, we tell SBT to exclude all the JARs that either start with "scala-", or are part of the Scala distribution:

assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

After this step, build.sbt will finally look like this:

name := "Simple Visualization"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.1.0" % "provided",
  "org.apache.spark" %% "spark-graphx" % "1.1.0" % "provided",
  "org.graphstream" % "gs-core" % "1.2+",
  "org.graphstream" % "gs-ui" % "1.2+",
  ("org.scalanlp" % "breeze-viz_2.10" % "0.9").exclude("jfree","jfreechart").exclude("jfree","jcommon"),
  "org.scalanlp" % "breeze_2.10" % "0.9"
)

resolvers ++= Seq(
  "Akka Repository" at "http://repo.akka.io/releases/",
  "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots",
  "Sonatype Releases" at "http://oss.sonatype.org/content/repositories/releases")

// Configure jar named used with the assembly plug-in
jarName in assembly := "graph-Viz-assembly.jar"

// Exclude Scala library (JARs that start with scala- and are included in the binary Scala distribution) 
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

Step 5 – Create the uber JAR

All that needs to be done now is to run the command called sbt assembly in the console to build the uber JAR. This must be done with the current directory set to the project base directory:

sbt clean assembly

This will create the uber JAR within the target/scala-2.10/ folder. You can look inside the built uber JAR to see all the classes that it contains, which are as follows:

jar tf target/scala-2.10/graph-Viz-assembly.jar

Finally, we can submit the built application with the spark-submit script by passing the assembly JAR this time:

../../bin/spark-submit --class com.github.giocode.graphxbook.SimpleGraphVizApp --master local target/scala-2.10/graph-Viz-assembly.jar

Running tasks with SBT commands

SBT provides different, useful commands for interacting with the build in the SBT console. These are listed as follows:

  • clean: This removes the files that were previously produced by the build, such as generated sources, compiled classes, and task caches
  • console: This starts the Scala shell with the project classes on the classpath
  • compile: This command compiles the sources
  • update: The execution of this command resolves and retrieves the dependencies, if required
  • package: This builds and produces a deployable JAR
  • assembly: This builds a uber JAR using the sbt-assembly plugin
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset