Lifecycle of Spark program

The following steps explain the lifecycle of a Spark application with standalone resource manager, and Figure 3.8 shows the scheduling process of a spark program:

  1. The user submits a spark application using the spark-submit command.
  2. Spark-submit launches the driver program on the same node in (client mode) or on the cluster (cluster mode) and invokes the main method specified by the user.
  3. The driver program contacts the cluster manager to ask for resources to launch executor JVMs based on the configuration parameters supplied.
  4. The cluster manager launches executor JVMs on worker nodes.
  5. The driver process scans through the user application. Based on the RDD actions and transformations in the program, Spark creates an operator graph.
  6. When an action (such as collect) is called, the graph is submitted to a DAG scheduler. The DAG scheduler divides the operator graph into stages.
  7. A stage comprises tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For instance, many map operators can be scheduled in a single stage. This optimization is the key to Spark's performance. The final result of a DAG scheduler is a set of stages.
  8. The stages are passed on to the task scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn't know about dependencies among stages.
  9. Tasks are run on executor processes to compute and save results.
  10. If the driver's main method exits or it calls SparkContext.stop(), it will terminate the executors and release resources from the cluster manager.

Figure 3.8 illustrates the scheduling process of a Spark program:

Lifecycle of Spark program

Figure 3.8: Spark scheduling process

Each task performs the same steps internally:

  • Fetches its input: either from data storage (for input RDDs) or an existing RDD, or shuffle outputs
  • Performs the operation or transformation to compute the RDD it represents
  • Writes output to a shuffle, external storage, or back to the driver

Let's understand the terminology used in Spark before we drill down further to the lifecycle of a Spark program:

Term

Meaning

Application

It is a program started by spark-shell or spark-submit.

Driver program

A driver program runs on the driver and is responsible for creating the SparkContext.

Cluster manager

Responsible for allocating and managing the resources such as CPU and memory. Cluster resource managers are the standalone manager, Mesos, and YARN.

Worker node

Any node that can run application code in the cluster. It is a worker JVM and you can have multiple workers running on the same machine in a standalone cluster manager.

Executor

Executor JVM is responsible for executing tasks sent by Driver JVM. Typically, every application will have multiple executors running on multiple workers.

DAG

DAG (Directed Acyclic Graph) enables cyclic data flow. For every Spark job, a DAG of multiple stages is created and executed. For a MapReduce application, only two stages (Map and Reduce) are always created.

Job

Each action such as collect, count, or saveas in an application is created as a job, which consists of multiple stages and multiple tasks.

Stage

Each job can be performed in a single stage or multiple stages depending on the complexity of operations such as necessary data shuffling. A stage will have multiple tasks.

Task

A unit of work that is sent from the driver to executors. A task is performed on every partition of the RDD. So, If the RDD has 10 partitions, 10 tasks will be performed.

Pipelining

In some cases, the physical set of stages will not be an exact 1:1 correspondence to the logical RDD graph. Pipelining occurs when RDDs can be computed from its parents without data movement. For example, when a user calls both map and filter sequentially, those can be collapsed into a single transformation, which first maps, then filters each element. But, complex RDD graphs are split into multiple stages by the DAG scheduler.

Spark's event timeline and DAG visualizations are made easy through the Spark UI from Version 1.4 onwards. Let's execute the following code to view DAG visualizations of a job and stages:

from operator import add
lines = sc.textFile("file:///home/cloudera/spark-2.0.0-bin-hadoop2.7/README.md")
counts = lines.flatMap(lambda x: x.split(' ')) 
              .map(lambda x: (x, 1)) 
              .reduceByKey(add)
output = counts.collect()
for (word, count) in output:
    print("%s: %i" % (word, count))

Figure 3.9 shows the visual DAG for the job and stages for the word count code above. It shows that the job is split into two stages because of the data shuffling happening in this case.

Pipelining

Figure 3.9: Spark Job's DAG visualization.

Figure 3.10 shows the event timeline for Stage 0, which indicates the time taken for each of the tasks.

Pipelining

Figure 3.10: Spark Job Event Timeline within stage 0.

Spark execution summary

Spark execution summary in a nutshell is explained in the following:

  • User code defines a DAG (Direct Acyclic Graph) of RDDs
  • Actions force translation of the DAG to an execution plan
  • Tasks are scheduled and executed on a cluster
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset