Parallelizing a collection

Parallelizing a collection can be done by calling parallelize() on the collection inside the driver program. The driver, when it tries to parallelize a collection, splits the collection into partitions and distributes the data partitions across the cluster.

The following is an RDD to create an RDD from a sequence of numbers using the SparkContext and the parallelize() function. The parallelize() function essentially splits the Sequence of numbers into a distributed collection otherwise known as an RDD.

scala> val rdd_one = sc.parallelize(Seq(1,2,3))
rdd_one: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd_one.take(10)
res0: Array[Int] = Array(1, 2, 3)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset