The following steps explain the lifecycle of a Spark application with standalone resource manager, and Figure 3.8 shows the scheduling process of a spark program:
spark-submit
command. SparkContext.stop()
, it will terminate the executors and release resources from the cluster manager. Figure 3.8 illustrates the scheduling process of a Spark program:
Each task performs the same steps internally:
Let's understand the terminology used in Spark before we drill down further to the lifecycle of a Spark program:
In some cases, the physical set of stages will not be an exact 1:1 correspondence to the logical RDD graph. Pipelining occurs when RDDs can be computed from its parents without data movement. For example, when a user calls both map and filter sequentially, those can be collapsed into a single transformation, which first maps, then filters each element. But, complex RDD graphs are split into multiple stages by the DAG scheduler.
Spark's event timeline and DAG visualizations are made easy through the Spark UI from Version 1.4 onwards. Let's execute the following code to view DAG visualizations of a job and stages:
from operator import add lines = sc.textFile("file:///home/cloudera/spark-2.0.0-bin-hadoop2.7/README.md") counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) output = counts.collect() for (word, count) in output: print("%s: %i" % (word, count))
Figure 3.9 shows the visual DAG for the job and stages for the word count code above. It shows that the job is split into two stages because of the data shuffling happening in this case.
Figure 3.10 shows the event timeline for Stage 0, which indicates the time taken for each of the tasks.