Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Spark applications

Let's understand the difference between spark Shell and spark applications and how they are created and submitted.

Spark Shell versus Spark applications

Spark lets you access your datasets through a simple, yet specialized, Spark shell for Scala, Python, R, and SQL. Users do not need to create a full application to explore the data. They can start exploring data with commands that can be converted to programs later. This provides higher developer productivity. A Spark application is a complete program with SparkContext that is submitted with the spark-submit command.

Scala programs are generally written using Scala IDE or IntelliJ IDEA and SBT is used to compile the programs. Java programs are generally written in Eclipse and compiled with Maven. Python and R programs can be written in any text editor and also using IDEs such as Eclipse. Once the Scala and Java programs are written, they are compiled and executed with the spark-submit command as shown in the following. Since Python and R are interpreter languages, they are directly executed using the spark-submit command. Spark 2.0 is built with scala 2.11, so scala 2.11 is needed to build spark applications using Scala.

Creating a Spark context

The first step in any Spark program is to create a Spark context that provides an entry point to the Spark API. Set configuration properties by passing a SparkConf object to SparkContext, as shown in the following, in Python code:

from pyspark import SparkConf, SparkContext 
conf = (SparkConf() 
  .setMaster("spark://masterhostname:7077") 
  .setAppName("My Analytical Application") 
  .set("spark.executor.memory", "2g")) 
sc = SparkContext(conf = conf)

SparkConf

SparkConf is the primary configuration mechanism in spark and an instance is required when creating a new SparkContext. A SparkConf instance contains string key/value pairs of configuration options that the user wants to override the defaults. SparkConf settings are hardcoded into the application code, passed from the command line, or passed from configuration files, as shown in the following code:

# Construct a conf
conf = new SparkConf()
conf.set("spark.app.name", "My Spark App")
conf.set("spark.master", "local[4]")
conf.set("spark.ui.port", "36000") # Override the default port
# Create a SparkContext with this configuration
sc = SparkContext(conf)

Tip

The SparkConf associated with a given application is immutable once it is passed to the SparkContext constructor. That means that all configuration decisions must be made before a SparkContext is instantiated.

SparkSubmit

The spark-submit script is used to launch spark applications on a cluster with any cluster resource manager.

SparkSubmit allows setting configurations dynamically and then injecting into the environment when the app is launched (when a new SparkConf is constructed). User apps can simply construct an 'empty' SparkConf and pass it directly to the SparkContext constructor if using SparkSubmit. The SparkSubmit tool provides built-in flags for the most common Spark configuration parameters and a generic --conf flag, which accepts any Spark config value as shown :

[cloudera@quickstart ~]$ spark-submit 
  --class com.example.loganalytics.MyApp 
  --master yarn 
  --name "Log Analytics Application" 
  --executor-memory 2G  
  --num-executors 50 
  --conf spark.shuffle.spill=false 
  myApp.jar  
  /data/input 
  /data/output

In case of multiple configuration parameters, put all of them in a file and pass it to the application using --properties-file:

[cloudera@quickstart ~]$ spark-submit 
   --class com.example.MyApp 
   --properties-file my-config-file.conf 
   myApp.jar 

## Contents of my-config-file.conf ##
spark.master spark://5.6.7.8:7077
spark.app.name "My Spark App"
spark.ui.port 36000
spark.executor.memory 2g
spark.serializer org.apache.spark.serializer.KryoSerializer

Application dependency JARs included with the --jars option will be automatically shipped to the worker nodes. For Python, the equivalent --py-files option can be used to distribute .egg, .zip, and .py libraries to executors. Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. It's always better to add all code dependencies within a JAR while creating the JAR. This can be easily done in Maven or SBT.

For getting a complete list of options for spark-submit, use the following command:

[cloudera@quickstart ~]$ spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

Spark Conf precedence order

Spark configuration precedence, from higher to lower, is as follows:

Configurations declared explicitly in the user's code using the set() function on a SparkConf object.
Flags passed to spark-submit or spark-shell.
Values in the spark-defaults.conf properties file.
Default values of Spark.

Important application configurations

Some of the important configuration parameters for submitting applications are listed in below table:

Command Line Parameter	Equivalent Configuration Property	Default	Meaning
`--master`	`spark.master`	None. If this parameter is not mentioned, it will choose local mode.	Spark's master URL. Options are `local`, `local(*)`, `local(n)`, `spark://masterhostname:7077`, `yarn-client`, `yarn-cluster`, and `mesos://host:port`.
`--class`	None	None	Application class.
`--deploy-mode`	None	Client Mode	Deploying application in client or cluster mode.
`--conf`	None	None	Pass arbitrary configuration in key value format.
`--py-files`	None	None	Add Python dependencies.
`--supervise`	None	None	Restart driver if it fails.
`--driver-memory`	`spark.driver.memory`	1G	Memory for Driver.
`--executor-memory`	`spark.executor.memory`	1G	Memory for executors.
`--total-executor-cores`	`spark.cores.max`	None. default will be `spark.deploy.defaultCores` on Spark's standalone cluster manager.	Used in Spark Standalone mode or Mesos coarse grained mode only.
`--num-executors`	`spark.executor.instances`	2	Number of executors in YARN mode.
`--executor-cores`	`spark.executor.cores`	1 in YARN mode, all the available cores on the worker in standalone mode.	Number of cores on each executor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Spark applications

Create new playlist

Sign In

Sign Up