How to do it...

Let's do the word count, which counts the number of occurrences of each word. In this recipe, we will load data from HDFS:

Create the words directory by using the following command:

$ mkdir words

Create the sh.txt text file and enter "to be or not to be" in it:

$ echo "to be or not to be" > words/sh.txt

Upload words:

$ hdfs dfs -put words

Start the Spark shell:

$ spark-shell

Load the words directory as a dataset:

scala> val words = spark.read.textFile("hdfs://localhost:9000/user/hduser/words")

The spark.read method also supports passing an additional option for the number of partitions, such as spark.read.repartition(10).textFile(...) . By default, Spark creates one partition for each InputSplit class, which roughly corresponds to one block. You can ask for a higher number of partitions. It works really well for compute-intensive jobs, such as in machine learning. As one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.

Count the number of lines:

scala> words.count

Divide the line (or lines) into multiple words:

scala> val wordsFlatMap = words.flatMap(_.split("W+"))

Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key:

scala> val wordsMap = wordsFlatMap.map( w => (w,1))

Count the occurrences:

scala> val wordCount = wordsMap.groupByKey(_._1).count.toDF("word","count")

Print the dataset:

scala> wordCount.show

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...