We are ready to check now that everything is working fine. The obligatory word count will be put to the test in processing a word count on the first chapter of this book.
The code we will be running is listed here:
# Word count on 1st Chapter of the Book using PySpark # import regex module import re # import add from operator module from operator import add # read input file file_in = sc.textFile('/home/an/Documents/A00_Documents/Spark4Py 20150315') # count lines print('number of lines in file: %s' % file_in.count()) # add up lengths of each line chars = file_in.map(lambda s: len(s)).reduce(add) print('number of characters in file: %s' % chars) # Get words from the input file words =file_in.flatMap(lambda line: re.split('W+', line.lower().strip())) # words of more than 3 characters words = words.filter(lambda x: len(x) > 3) # set count 1 per word words = words.map(lambda w: (w,1)) # reduce phase - sum count all the words words = words.reduceByKey(add)
In this program, we are first reading the file from the directory /home/an/Documents/A00_Documents/Spark4Py 20150315
into file_in
.
We are then introspecting the file by counting the number of lines and the number of characters per line.
We are splitting the input file in to words and getting them in lower case. For our word count purpose, we are choosing words longer than three characters in order to avoid shorter and much more frequent words such as the, and, for to skew the count in their favor. Generally, they are considered stop words and should be filtered out in any language processing task.
At this stage, we are getting ready for the MapReduce steps. To each word, we map a value of 1
and reduce it by summing all the unique words.
Here are illustrations of the code in the IPython Notebook. The first 10 cells are preprocessing the word count on the dataset, which is retrieved from the local file directory.
Swap the word count tuples in the format (count, word)
in order to sort by count
, which is now the primary key of the tuple:
# create tuple (count, word) and sort in descending words = words.map(lambda x: (x[1], x[0])).sortByKey(False) # take top 20 words by frequency words.take(20)
In order to display our result, we are creating the tuple (count, word)
and displaying the top 20 most frequently used words in descending order:
Let's create a histogram function:
# create function for histogram of most frequent words % matplotlib inline import matplotlib.pyplot as plt # def histogram(words): count = map(lambda x: x[1], words) word = map(lambda x: x[0], words) plt.barh(range(len(count)), count,color = 'grey') plt.yticks(range(len(count)), word) # Change order of tuple (word, count) from (count, word) words = words.map(lambda x:(x[1], x[0])) words.take(25) # display histogram histogram(words.take(25))
Here, we visualize the most frequent words by plotting them in a bar chart. We have to first swap the tuple from the original (count, word)
to (word, count)
:
So here you have it: the most frequent words used in the first chapter are Spark, followed by Data and Anaconda.