Let's do the word count, which counts the number of occurrences of each word. In this recipe, we will load data from HDFS:
- Create the words directory by using the following command:
$ mkdir words
- Create the sh.txt text file and enter "to be or not to be" in it:
$ echo "to be or not to be" > words/sh.txt
- Upload words:
$ hdfs dfs -put words
- Start the Spark shell:
$ spark-shell
- Load the words directory as a dataset:
scala> val words = spark.read.textFile("hdfs://localhost:9000/user/hduser/words")
The spark.read method also supports passing an additional option for the number of partitions, such as spark.read.repartition(10).textFile(...) . By default, Spark creates one partition for each InputSplit class, which roughly corresponds to one block. You can ask for a higher number of partitions. It works really well for compute-intensive jobs, such as in machine learning. As one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.
- Count the number of lines:
scala> words.count
- Divide the line (or lines) into multiple words:
scala> val wordsFlatMap = words.flatMap(_.split("W+"))
- Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key:
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
- Count the occurrences:
scala> val wordCount = wordsMap.groupByKey(_._1).count.toDF("word","count")
- Print the dataset:
scala> wordCount.show