Building our first app with PySpark

We are ready to check now that everything is working fine. The obligatory word count will be put to the test in processing a word count on the first chapter of this book.

The code we will be running is listed here:

# Word count on 1st Chapter of the Book using PySpark

# import regex module
import re
# import add from operator module
from operator import add


# read input file
file_in = sc.textFile('/home/an/Documents/A00_Documents/Spark4Py 20150315')

# count lines
print('number of lines in file: %s' % file_in.count())

# add up lengths of each line
chars = file_in.map(lambda s: len(s)).reduce(add)
print('number of characters in file: %s' % chars)

# Get words from the input file
words =file_in.flatMap(lambda line: re.split('W+', line.lower().strip()))
# words of more than 3 characters
words = words.filter(lambda x: len(x) > 3)
# set count 1 per word
words = words.map(lambda w: (w,1))
# reduce phase - sum count all the words
words = words.reduceByKey(add)

In this program, we are first reading the file from the directory /home/an/Documents/A00_Documents/Spark4Py 20150315 into file_in.

We are then introspecting the file by counting the number of lines and the number of characters per line.

We are splitting the input file in to words and getting them in lower case. For our word count purpose, we are choosing words longer than three characters in order to avoid shorter and much more frequent words such as the, and, for to skew the count in their favor. Generally, they are considered stop words and should be filtered out in any language processing task.

At this stage, we are getting ready for the MapReduce steps. To each word, we map a value of 1 and reduce it by summing all the unique words.

Here are illustrations of the code in the IPython Notebook. The first 10 cells are preprocessing the word count on the dataset, which is retrieved from the local file directory.

Building our first app with PySpark

Swap the word count tuples in the format (count, word) in order to sort by count, which is now the primary key of the tuple:

# create tuple (count, word) and sort in descending
words = words.map(lambda x: (x[1], x[0])).sortByKey(False)

# take top 20 words by frequency
words.take(20)

In order to display our result, we are creating the tuple (count, word) and displaying the top 20 most frequently used words in descending order:

Building our first app with PySpark

Let's create a histogram function:

# create function for histogram of most frequent words

% matplotlib inline
import matplotlib.pyplot as plt
#

def histogram(words):
    count = map(lambda x: x[1], words)
    word = map(lambda x: x[0], words)
    plt.barh(range(len(count)), count,color = 'grey')
    plt.yticks(range(len(count)), word)

# Change order of tuple (word, count) from (count, word) 
words = words.map(lambda x:(x[1], x[0]))
words.take(25)

# display histogram
histogram(words.take(25))

Here, we visualize the most frequent words by plotting them in a bar chart. We have to first swap the tuple from the original (count, word) to (word, count):

Building our first app with PySpark

So here you have it: the most frequent words used in the first chapter are Spark, followed by Data and Anaconda.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset