Partitioning and shuffling

We have seen how Apache Spark can handle distributed computing much better than Hadoop. We also saw the inner workings, mainly the fundamental data structure known as Resilient Distributed Dataset (RDD). RDDs are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery. RDDs operate on data not as a single blob of data, rather RDDs manage and operate data in partitions spread across the cluster. Hence, the concept of data partitioning is critical to the proper functioning of Apache Spark Jobs and can have a big effect on the performance as well as how the resources are utilized.

RDD consists of partitions of data and all operations are performed on the partitions of data in the RDD. Several operations like transformations are functions executed by an executor on the specific partition of data being operated on. However, not all operations can be done by just performing isolated operations on the partitions of data by the respective executors. Operations like aggregations (seen in the preceding section) require data to be moved across the cluster in a phase known as shuffling. In this section, we will look deeper into the concepts of partitioning and shuffling.

Let's start looking at a simple RDD of integers by executing the following code. Spark Context's parallelize function creates an RDD from the Sequence of integers. Then, using the getNumPartitions() function, we can get the number of partitions of this RDD.

scala> val rdd_one = sc.parallelize(Seq(1,2,3))
rdd_one: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[120] at parallelize at <console>:25

scala> rdd_one.getNumPartitions
res202: Int = 8

The RDD can be visualized as shown in the following diagram, which shows the 8 partitions in the RDD:

The number of partitions is important because this number directly influences the number of tasks that will be running RDD transformations. If the number of partitions is too small, then we will use only a few CPUs/cores on a lot of data thus having a slower performance and leaving the cluster underutilized. On the other hand, if the number of partitions is too large then you will use more resources than you actually need and in a multi-tenant environment could be causing starvation of resources for other Jobs being run by you or others in your team.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset