Understanding resilient distributed dataset

Though RDD is getting replaced with DataFrame/DataSet-based APIs, there are still a lot of APIs that have not been migrated yet. In this recipe, we will look at how the concept of lineage works in RDD.

Externally, RDD is a distributed, immutable collection of objects. Internally, it consists of the following five parts:

Set of partitions (rdd.getPartitions)
List of dependencies on parent RDDs (rdd.dependencies)
Function to compute a partition, given its parents
Partitioner, which is optional (rdd.partitioner)
Preferred location of each partition, which is optional (rdd.preferredLocations)

The first three are needed for an RDD to be recomputed in case data is lost. When combined, it is called lineage. The last two parts are optimizations.

A set of partitions is how data is divided into nodes. In the case of HDFS, it means InputSplits, which are mostly the same as the block (except when a record crosses block boundaries; in that case, it will be slightly bigger than a block).

Table of Contents for
Understanding resilient distributed dataset - RDD

Understanding resilient distributed dataset - RDD

Table of Contents for Understanding resilient distributed dataset - RDD

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding resilient distributed dataset - RDD