How to do it...

Specify the number of partitions when loading a file into RDD with the following steps:

Start the Spark shell:

        $ spark-shell

Load the RDD with a custom number of partitions as a second parameter:

        scala> sc.textFile("hdfs://localhost:9000/user/hduser/words",10)

Another approach is to change the default parallelism by performing the following steps:

Start the Spark shell with the new value of default parallelism:

        $ spark-shell --conf spark.default.parallelism=10

Have the number of partitions two to three times the number of cores to maximize parallelism.

Check the default value of parallelism:

        scala> sc.defaultParallelism

You can also reduce the number of partitions using an RDD method called coalesce(numPartitions), where numPartitions is the final number of partitions you would like. If you want the data to be reshuffled over the network, you can call the RDD method called repartition(numPartitions), where numPartitions is the final number of partitions you would like.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...