Let's understand the difference between spark Shell and spark applications and how they are created and submitted.
Spark lets you access your datasets through a simple, yet specialized, Spark shell for Scala, Python, R, and SQL. Users do not need to create a full application to explore the data. They can start exploring data with commands that can be converted to programs later. This provides higher developer productivity. A Spark application is a complete program with SparkContext that is submitted with the spark-submit
command.
Scala programs are generally written using Scala IDE or IntelliJ IDEA and SBT is used to compile the programs. Java programs are generally written in Eclipse and compiled with Maven. Python and R programs can be written in any text editor and also using IDEs such as Eclipse. Once the Scala and Java programs are written, they are compiled and executed with the spark-submit command as shown in the following. Since Python and R are interpreter languages, they are directly executed using the spark-submit command. Spark 2.0 is built with scala 2.11, so scala 2.11 is needed to build spark applications using Scala.
The first step in any Spark program is to create a Spark context that provides an entry point to the Spark API. Set configuration properties by passing a SparkConf
object to SparkContext
, as shown in the following, in Python code:
from pyspark import SparkConf, SparkContext conf = (SparkConf() .setMaster("spark://masterhostname:7077") .setAppName("My Analytical Application") .set("spark.executor.memory", "2g")) sc = SparkContext(conf = conf)
SparkConf
is the primary configuration mechanism in spark and an instance is required when creating a new SparkContext
. A SparkConf
instance contains string key/value pairs of configuration options that the user wants to override the defaults. SparkConf
settings are hardcoded into the application code, passed from the command line, or passed from configuration files, as shown in the following code:
# Construct a conf conf = new SparkConf() conf.set("spark.app.name", "My Spark App") conf.set("spark.master", "local[4]") conf.set("spark.ui.port", "36000") # Override the default port # Create a SparkContext with this configuration sc = SparkContext(conf)
The spark-submit
script is used to launch spark applications on a cluster with any cluster resource manager.
SparkSubmit
allows setting configurations dynamically and then injecting into the environment when the app is launched (when a new SparkConf
is constructed). User apps can simply construct an 'empty' SparkConf
and pass it directly to the SparkContext
constructor if using SparkSubmit
. The SparkSubmit
tool provides built-in flags for the most common Spark configuration parameters and a generic --conf
flag, which accepts any Spark config value as shown :
[cloudera@quickstart ~]$ spark-submit --class com.example.loganalytics.MyApp --master yarn --name "Log Analytics Application" --executor-memory 2G --num-executors 50 --conf spark.shuffle.spill=false myApp.jar /data/input /data/output
In case of multiple configuration parameters, put all of them in a file and pass it to the application using --properties-file
:
[cloudera@quickstart ~]$ spark-submit --class com.example.MyApp --properties-file my-config-file.conf myApp.jar ## Contents of my-config-file.conf ## spark.master spark://5.6.7.8:7077 spark.app.name "My Spark App" spark.ui.port 36000 spark.executor.memory 2g spark.serializer org.apache.spark.serializer.KryoSerializer
Application dependency JARs included with the --jars
option will be automatically shipped to the worker nodes. For Python, the equivalent --py-files
option can be used to distribute .egg
, .zip
, and .py
libraries to executors. Note that JARs and files are copied to the working directory for each SparkContext
on the executor nodes. It's always better to add all code dependencies within a JAR while creating the JAR. This can be easily done in Maven or SBT.
For getting a complete list of options for spark-submit
, use the following command:
[cloudera@quickstart ~]$ spark-submit --help Usage: spark-submit [options] <app jar | python file> [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...]
Spark configuration precedence, from higher to lower, is as follows:
set()
function on a SparkConf
object.spark-submit
or spark-shell
.spark-defaults.conf
properties file.Some of the important configuration parameters for submitting applications are listed in below table: