The first step for deploying our Spark application on a cluster is to bundle it into a single Uber JAR, also known as the assembly
JAR. In this recipe, we'll be looking at how to use the SBT assembly plugin to generate the assembly
JAR. We'll be using this assembly
JAR in subsequent recipes when we run Spark in distributed mode. We could alternatively set dependent JARs using the spark.driver.extraClassPath
property (https://spark.apache.org/docs/1.3.1/configuration.html#runtime-environment). However, for a large number of dependent JARs, this is inconvenient.
The goal of building the assembly
JAR is to build a single, Fat JAR that contains all dependencies and our Spark application. Refer to the following screenshot, which shows the innards of an assembly
JAR. You can see not only the application's files in the JAR, but also all the packages and files of the dependent libraries:
The assembly
JAR can easily be built in SBT using the SBT assembly plugin (https://github.com/sbt/sbt-assembly).
In order to install the sbt-assembly
plugin, let's add the following line to our project/assembly.sbt
:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")
Next, the most common issue that we face while trying to build the assembly
JAR (or Uber JAR) is the problem of duplicates—duplicate transitive dependency JARs, or simply duplicate files located at the same location (such as MANIFEST.MF
) in different bundled JARs. The easiest way to figure out is to install the
sbt-dependency-graph
plugin (https://github.com/jrudolph/sbt-dependency-graph) and check which two trees bring in the conflicting JAR.
In order to add the sbt-dependency-graph
plugin, let's add the following line to our project/plugins.sbt
:
addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.7.5")
Let's try to build the Uber JAR using sbt assembly
. When we issue this command from the root of the project, we get an error that tells us that we have duplicate files in our JAR.
Let's see an example of a duplicate error message that we might face:
deduplicate: different file contents found in the following: /Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans-2.3.0.jar:org/w3c/dom/DOMStringList.class /Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01.jar:org/w3c/dom/DOMStringList.class deduplicate: different file contents found in the following: /Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans-2.3.0.jar:org/w3c/dom/TypeInfo.class /Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01.jar:org/w3c/dom/TypeInfo.class deduplicate: different file contents found in the following: /Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans-2.3.0.jar:org/w3c/dom/UserDataHandler.class /Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01.jar:org/w3c/dom/UserDataHandler.class
This happens most commonly if:
sbt
dependencies depend on the same external library (or libraries that have bundled the classes with the same package)sbt
Whatever the case, it is always recommended to go through the entire dependency tree to trim it down.
A simpler way is to export the dependency tree in an ASCII tree format and eyeball it to find the two instances where the xmlbeans
JAR is referred to. The sbt dependency
graph plugin lets us do that. Once we have installed the plugin as per the instructions, we can export and inspect the dependency tree:
sbt dependency-tree > deptree.txt
The graph can also be visualized using a real graph (however, this lacks the text search capabilities). The sbt dependency
graph helps us analyze that too. We can export the same tree as a .dot
file using this code:
sbt dependency-dot > depdot.dot
It outputs a depdot.dot
file in our target directory, which can be opened using Graphviz (http://www.graphviz.org/). Refer to the following screenshot to see what the visualization of a .dot
file in Graphviz looks like:
As we can see in lines 96 and 573 of the dependency tree (refer to https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter6-scalingup/depgraph_xmlbeans_duplicate.txt; its screenshot is given), there are two instances of the import of xmlbeans
: once in the tree that leads to org.scalanlp:epic-parser-en-span_2.10:2015.2.19
, and once in the tree that leads to org.scalanlp:epic_2.10:0.3.1
. If you notice the second level of the epic-parser
library, you will realize that it is the epic
library itself.
So, we can resolve this error by removing scalanlp:epic_2.10:0.3.1
from the list of dependencies in our build.sbt
file.
Even after we have removed the epic
library, we still see some issues with the xercesImpl
and xmlapi
JARs. When we analyze the dependency tree, we see that two dependent libraries of epic
depend on xerces
, the xml
API and the scala
library itself!
We notice that the Epic library has a dependency on the Scala library, but we also know that the Scala library should already be available on the master and the worker nodes. We can exclude the Scala library altogether from getting bundled using the assemblyOption
key:
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
Next, in order to exclude the xml-apis
library from the epic
library, we use the exclude
function:
libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided", "com.databricks" %% "spark-csv" % "1.0.3", ("org.scalanlp" % "epic-parser-en-span_2.10" % "2015.2.19"). exclude("xml-apis", "xml-apis") )
As for the rest of the conflicting files, we can use the assembly
plugin's merge strategy to resolve the conflict. Since we are merging contents of multiple JARs, there is a distinct possibility of a similarly named file being available on the same path, for example, MANIFEST.MF
. The sbt-assembly
plugin provides various strategies to resolve conflicts if the contents of the file in the same location don't match. The default strategy is to throw an error, but we can customize the strategy to suit our needs.
In the merge strategy, we append the contents of application.conf
if there are multiple conf
files in the JARs, use the first matching class/file in the order of the class path for the org.cyberneko.html
package, and discard all the manifest
files. For all others, we apply the default strategy:
assemblyMergeStrategy in assembly := { case "application.conf" => MergeStrategy.concat case PathList("org", "cyberneko", "html", xs @ _*) => MergeStrategy.first case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard case f => (assemblyMergeStrategy in assembly).value(f) }
The entire build.sbt
looks like this:
organization := "com.packt" name := "chapter6-scalingup" scalaVersion := "2.10.4" val sparkVersion="1.4.1" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided", "com.databricks" %% "spark-csv" % "1.0.3", ("org.scalanlp" % "epic-parser-en-span_2.10" % "2015.2.19"). exclude("xml-apis", "xml-apis") ) assemblyJarName in assembly := "scalada-learning-assembly.jar" assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false) assemblyMergeStrategy in assembly := { case "application.conf" => MergeStrategy.concat case PathList("org", "cyberneko", "html", xs @ _*) => MergeStrategy.first case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard case f => (assemblyMergeStrategy in assembly).value(f) }
So finally, when we do an sbt assembly
, scalada-learning-assembly.jar
is created. If you would like the JAR name to be picked up from the build.sbt
file's name and version, just delete the assemblyJarName
key from build.sbt
:
> sbt clean assembly