130 | Big Data Simplied
S. No Pair RDD Transformations and Meaning
Example:
valkeysWithValuesList = Array(“foo=A”, “foo=A”, “foo=A”, “foo=A”, “foo=B”, “bar=C”,
“bar=D”, “bar=D”)
val data = sc.parallelize(keysWithValuesList)
//Create key value pairs
valkv = data.map(_.split(“=”)).map(v => (v(0), v(1))).cache()
valinitialCount = 0;
valaddToCounts = (n: Int, v: String) => n + 1
valsumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
valcountByKey = kv.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)
4. sortByKey([ascending], [numTasks])
When called on a dataset of (K, V) pairs where K implements ordered, returns a dataset
of (K, V) pairs sorted by keys in ascending or descending order as specied in the
Boolean ascending argument.
Example:
val data = spark.sparkContext.parallelize(Seq((“maths”,52), (“english”,75), (“science”,82),
(“computer”,65), (“maths”,85)))
val sorted = data.sortByKey()
sorted.foreach(println)
5. join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V,W))
pairs with all pairs of elements for each key. Outer joins are supported through
leftOuterJoin, rightOuterJoin and fullOuterJoin.
Example:
val data = spark.sparkContext.parallelize(Array((‘A’,1),(‘b’,2),(‘c’,3)))
val data2 =spark.sparkContext.parallelize(Array((‘A’,4),(‘A’,6),(‘b’,7),(‘c’,3),(‘c’,8)))
val result = data.join(data2)
println(result.collect().mkString(“,”))
Difference between groupByKey() and reduceByKey()—groupByKey():
On applying groupByKey() on a
dataset of (K, V) pairs, the data shufing occurs according to the key value K in another RDD
(Ref.Figure 6.8). In this transformation, huge amount of unnecessary data transfers over the
network.
Spark provides the provision to save data to disk when there is more data shuffling onto a
single executor machine than can fit in memory.
val words = Array(“one”, “two”, “two”, “three”, “three”, “three”)
valwordPairsRDD = sc.parallelize(words).map(word => (word, 1))
valwordCountsWithGroup = wordPairsRDD.groupByKey().map(t =>
(t._1,t._2.sum)).collect()
M06 Big Data Simplified XXXX 01.indd 130 5/17/2019 2:49:11 PM