How to do it...

Specify the number of partitions when loading a file into RDD with the following steps:

  1. Start the Spark shell:
        $ spark-shell
  1. Load the RDD with a custom number of partitions as a second parameter:
        scala> sc.textFile("hdfs://localhost:9000/user/hduser/words",10)

Another approach is to change the default parallelism by performing the following steps:

  1. Start the Spark shell with the new value of default parallelism:
        $ spark-shell --conf spark.default.parallelism=10
Have the number of partitions two to three times the number of cores to maximize parallelism. 
  1. Check the default value of parallelism:
        scala> sc.defaultParallelism
You can also reduce the number of partitions using an RDD method called coalesce(numPartitions), where numPartitions is the final number of partitions you would like. If you want the data to be reshuffled over the network, you can call the RDD method called repartition(numPartitions), where numPartitions is the final number of partitions you would like.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset