How to do it...

Let's do the word count, which counts the number of occurrences of each word. In this recipe, we will load data from HDFS:

  1. Create the words directory by using the following command:
$ mkdir words
  1. Create the sh.txt  text file and enter "to be or not to be" in it:
$ echo "to be or not to be" > words/sh.txt
  1. Upload words: 
$ hdfs dfs -put words
  1. Start the Spark shell:
$ spark-shell
  1. Load the words directory as a dataset:
scala> val words = spark.read.textFile("hdfs://localhost:9000/user/hduser/words")
The spark.read method also supports passing an additional option for the number of partitions, such as spark.read.repartition(10).textFile(...) . By default, Spark creates one partition for each InputSplit class, which roughly corresponds to one block. You can ask for a higher number of partitions. It works really well for compute-intensive jobs, such as in machine learning. As one partition cannot contain more than one block, having fewer partitions than blocks is not allowed.
  1. Count the number of lines: 
scala> words.count
  1. Divide the line (or lines) into multiple words:
scala> val wordsFlatMap = words.flatMap(_.split("W+"))
  1. Convert word to (word,1)—that is, output 1 as a value for each occurrence of word as a key:
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
  1. Count the occurrences:
scala> val wordCount = wordsMap.groupByKey(_._1).count.toDF("word","count")
  1. Print the dataset:
scala> wordCount.show
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset