Building the Uber JAR

The first step for deploying our Spark application on a cluster is to bundle it into a single Uber JAR, also known as the assembly JAR. In this recipe, we'll be looking at how to use the SBT assembly plugin to generate the assembly JAR. We'll be using this assembly JAR in subsequent recipes when we run Spark in distributed mode. We could alternatively set dependent JARs using the spark.driver.extraClassPath property (https://spark.apache.org/docs/1.3.1/configuration.html#runtime-environment). However, for a large number of dependent JARs, this is inconvenient.

How to do it...

The goal of building the assembly JAR is to build a single, Fat JAR that contains all dependencies and our Spark application. Refer to the following screenshot, which shows the innards of an assembly JAR. You can see not only the application's files in the JAR, but also all the packages and files of the dependent libraries:

How to do it...

The assembly JAR can easily be built in SBT using the SBT assembly plugin (https://github.com/sbt/sbt-assembly).

In order to install the sbt-assembly plugin, let's add the following line to our project/assembly.sbt:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

Next, the most common issue that we face while trying to build the assembly JAR (or Uber JAR) is the problem of duplicates—duplicate transitive dependency JARs, or simply duplicate files located at the same location (such as MANIFEST.MF) in different bundled JARs. The easiest way to figure out is to install the sbt-dependency-graph plugin (https://github.com/jrudolph/sbt-dependency-graph) and check which two trees bring in the conflicting JAR.

In order to add the sbt-dependency-graph plugin, let's add the following line to our project/plugins.sbt:

addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.7.5")

Let's try to build the Uber JAR using sbt assembly. When we issue this command from the root of the project, we get an error that tells us that we have duplicate files in our JAR.

Let's see an example of a duplicate error message that we might face:

deduplicate: different file contents found in the following:
/Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans-2.3.0.jar:org/w3c/dom/DOMStringList.class
/Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01.jar:org/w3c/dom/DOMStringList.class
deduplicate: different file contents found in the following:
/Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans-2.3.0.jar:org/w3c/dom/TypeInfo.class
/Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01.jar:org/w3c/dom/TypeInfo.class
deduplicate: different file contents found in the following:
/Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans-2.3.0.jar:org/w3c/dom/UserDataHandler.class
/Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01.jar:org/w3c/dom/UserDataHandler.class

This happens most commonly if:

  • Two different libraries in our sbt dependencies depend on the same external library (or libraries that have bundled the classes with the same package)
  • We have explicitly stated the transitive dependency as a separate dependency in sbt

Whatever the case, it is always recommended to go through the entire dependency tree to trim it down.

Transitive dependency stated explicitly in the SBT dependency

A simpler way is to export the dependency tree in an ASCII tree format and eyeball it to find the two instances where the xmlbeans JAR is referred to. The sbt dependency graph plugin lets us do that. Once we have installed the plugin as per the instructions, we can export and inspect the dependency tree:

sbt dependency-tree > deptree.txt

The graph can also be visualized using a real graph (however, this lacks the text search capabilities). The sbt dependency graph helps us analyze that too. We can export the same tree as a .dot file using this code:

sbt dependency-dot > depdot.dot

It outputs a depdot.dot file in our target directory, which can be opened using Graphviz (http://www.graphviz.org/). Refer to the following screenshot to see what the visualization of a .dot file in Graphviz looks like:

Transitive dependency stated explicitly in the SBT dependency

As we can see in lines 96 and 573 of the dependency tree (refer to https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter6-scalingup/depgraph_xmlbeans_duplicate.txt; its screenshot is given), there are two instances of the import of xmlbeans: once in the tree that leads to org.scalanlp:epic-parser-en-span_2.10:2015.2.19, and once in the tree that leads to org.scalanlp:epic_2.10:0.3.1. If you notice the second level of the epic-parser library, you will realize that it is the epic library itself.

So, we can resolve this error by removing scalanlp:epic_2.10:0.3.1 from the list of dependencies in our build.sbt file.

Two different libraries depend on the same external library

Even after we have removed the epic library, we still see some issues with the xercesImpl and xmlapi JARs. When we analyze the dependency tree, we see that two dependent libraries of epic depend on xerces, the xml API and the scala library itself!

Two different libraries depend on the same external library

We notice that the Epic library has a dependency on the Scala library, but we also know that the Scala library should already be available on the master and the worker nodes. We can exclude the Scala library altogether from getting bundled using the assemblyOption key:

assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

Next, in order to exclude the xml-apis library from the epic library, we use the exclude function:

libraryDependencies  ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
  "com.databricks" %% "spark-csv" % "1.0.3",
  ("org.scalanlp" % "epic-parser-en-span_2.10" % "2015.2.19").
    exclude("xml-apis", "xml-apis")
)

As for the rest of the conflicting files, we can use the assembly plugin's merge strategy to resolve the conflict. Since we are merging contents of multiple JARs, there is a distinct possibility of a similarly named file being available on the same path, for example, MANIFEST.MF. The sbt-assembly plugin provides various strategies to resolve conflicts if the contents of the file in the same location don't match. The default strategy is to throw an error, but we can customize the strategy to suit our needs.

In the merge strategy, we append the contents of application.conf if there are multiple conf files in the JARs, use the first matching class/file in the order of the class path for the org.cyberneko.html package, and discard all the manifest files. For all others, we apply the default strategy:

assemblyMergeStrategy in assembly := {
  case "application.conf"                            => MergeStrategy.concat
  case PathList("org", "cyberneko", "html", xs @ _*) => MergeStrategy.first
  case m if m.toLowerCase.endsWith("manifest.mf")    => MergeStrategy.discard
  case f                                             => (assemblyMergeStrategy in assembly).value(f)
}

The entire build.sbt looks like this:

organization := "com.packt"

name := "chapter6-scalingup"

scalaVersion := "2.10.4"
val sparkVersion="1.4.1"

libraryDependencies  ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
  "com.databricks" %% "spark-csv" % "1.0.3",
  ("org.scalanlp" % "epic-parser-en-span_2.10" % "2015.2.19").
    exclude("xml-apis", "xml-apis")
)

assemblyJarName in assembly := "scalada-learning-assembly.jar"

assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

assemblyMergeStrategy in assembly := {
  case "application.conf"                            => MergeStrategy.concat
  case PathList("org", "cyberneko", "html", xs @ _*) => MergeStrategy.first
  case m if m.toLowerCase.endsWith("manifest.mf")    => MergeStrategy.discard
  case f                                             => (assemblyMergeStrategy in assembly).value(f)
}

So finally, when we do an sbt assembly, scalada-learning-assembly.jar is created. If you would like the JAR name to be picked up from the build.sbt file's name and version, just delete the assemblyJarName key from build.sbt:

> sbt clean assembly
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset