Understanding resilient distributed dataset - RDD

Though RDD is getting replaced with DataFrame/DataSet-based APIs, there are still a lot of APIs that have not been migrated yet. In this recipe, we will look at how the concept of lineage works in RDD.

Externally, RDD is a distributed, immutable collection of objects. Internally, it consists of the following five parts:

  • Set of partitions (rdd.getPartitions)
  • List of dependencies on parent RDDs (rdd.dependencies)
  • Function to compute a partition, given its parents
  • Partitioner, which is optional (rdd.partitioner)
  • Preferred location of each partition, which is optional (rdd.preferredLocations)

The first three are needed for an RDD to be recomputed in case data is lost. When combined, it is called lineage. The last two parts are optimizations.

A set of partitions is how data is divided into nodes. In the case of HDFS, it means InputSplits, which are mostly the same as the block (except when a record crosses block boundaries; in that case, it will be slightly bigger than a block).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset