12.2 Database Programming in Python
12.3 Functional Language Approach
Solutions to Practice Problems
IN THIS CHAPTER, we introduce several approaches to handle the vast amounts of data that are created, stored, accessed, and processed in today's computing applications.
We start by introducing relational databases and the language used to access them, SQL. Unlike many of the programs we have developed so far in this book, real-world application programs usually make heavy use of databases to store and access data. This is because databases store data in a way that enables easy and efficient access to the data. For this reason, it is important to develop an early appreciation of the benefits of databases and how to make effective use of them.
The amount of data generated by web crawlers, scientific experiments, or the stock markets is so vast that no single computer can process it effectively. Instead, a joint effort by multiple compute nodes—whether computers, processors, or cores—is necessary. We introduce an approach to develop parallel programs that make effective use of the multiple cores of a modern microprocessor. We then use this to develop the MapReduce framework, an approach for processing data developed by Google that can scale from a few cores on a personal computer to hundreds of thousands of cores in a server farm.
Data that is processed by a program exists only while the program executes. In order for data to persist beyond the execution of the program—so it can be processed later by some other program, for example—the data must be stored in a file.
So far in this book, we have been using standard text files to store data persistently. The advantage of text files is that they are general purpose and easy to work with. Their disadvantage is that they have no structure; they have, in particular, no structure that permits data to be efficiently accessed and processed.
In this section, we introduce a special type of file, called a database file or simply a database, that stores data in structured way. The structure makes the data in a database file amenable to efficient processing, including efficient insertion, update, deletion, and, especially, access. A database is a far more appropriate data storage approach than a general text file in many applications, and it is important to know how to work with databases.
In Chapter 11, we developed a web crawler—a program that visits web page after web page by following hyperlinks. The crawler scans the content of each visited web page and outputs information about it, including all the hyperlink URLs contained in the web page and the frequency of every word in the page. If we ran the crawler on the set of linked web pages shown in Figure 12.1, with each page containing names of some world cities with indicated frequencies, the hyperlink URLs would be output in this format:
URL Link one.html two.html one.html three.html two.html four.html …
The first two lines, for example, indicate that page one.html contains links to pages two.html and three.html.
The crawler would output the frequency of every word in every web page in this format:
URL Word Freq one.html Beijing 3 one.html Paris 5 one.html Chicago 5 two.html Bogota 3 …
So page one.html contains three occurrences of ‘Beijing’, five of ‘Paris’, and five of ‘Chicago’.
Suppose we are interested in analyzing the data set collected by the crawler. We might, for example, be interested in making queries such as:
Answering each of these questions on the data set produced by the crawler would be quite cumbersome. The text file format of the data set would require the file to be read into a string, and then ad hoc string operations would have to be used to retrieve the relevant data. For example, to answer question 1., we would have to find all the lines in the file containing word X, split each line into words (i.e., strings separated by blanks), collect the first word in every line, and then eliminate duplicate URLs.
An alternative approach would be to save the information gathered by the crawler into a database file rather than a general-purpose text file. A database file stores data in a structured way that enables efficient access and processing of the data.
Structured means that data in a database file is stored in one or more tables. Each table is identified by a name, such as Customers or Products, and consists of columns and rows. Each column has a name and contains data of a specific type: string, integer, real (float), and so on. Each row of the table contains data corresponding to one database record.
In our example, the information obtained by the crawler on the web pages shown in Figure 12.1 could be stored in two database tables shown in Figure 12.2. The first table, called Hyperlinks, has columns named Url and Link. Each row (record) in that table has a string X in column Page and a string Y in column Link and refers to a hyperlink with URL Y in web page X. The second table, called Keywords, has columns named Url, Word, and Freq. Each record consists of strings X and Y in columns Url and Word, respectively, and integer Z in column Freq, and corresponds to word Y appearing in web page with URL X with frequency Z.
With data stored in database tables, we can make data queries using a special database programming language.
Database files are not read from or written to by an application program using the usual file input/output interface. They typically are also not accessed directly. Instead, the application program usually sends commands to a special type of server program called a database engine or a database management system that manages the database; that program will access the database file on the application's behalf.
The commands accepted by database engines are statements written in a query language, the most popular of which is called Structured Query Language, typically referred to as SQL. Next we introduce a small subset of SQL that we can use to write programs that can make use of databases, when databases are the right choice for data storage.
The SQL statement SELECT is used make queries into a database. In its simplest form, this statement is used to retrieve a column of a database table. For example, to retrieve column Link from table Hyperlinks, you would use:
SELECT Link FROM Hyperlinks
The result of executing this statement is stored in a result table (also called a result set), illustrated in Figure 12.3(a).
We use uppercase characters to highlight the SQL statement keywords; SQL statements are not case-sensitive so we could use lowercase characters. In general, the SQL statement SELECT retrieves a subset of columns from the table and has this format:
SELECT Column(s) FROM TableName
For example, to select the content of columns Url and Word from table Keywords, you would use:
SELECT Url, Word FROM Keywords
The result table that is obtained in shown in Figure 12.3(b). In order to retrieve all the columns of table Keywords, the wildcard symbol * may be used:
SELECT * FROM Hyperlinks
The result table obtained is the original table Hyperlinks shown in Figure 12.2(a).
When we made the query
SELECT Link FROM Hyperlinks
the result set we obtained included multiple copies of the same link. If we wanted to retrieve only the distinct links in column Link, we could use the SQL DISTINCT keyword
SELECT DISTINCT Link FROM Hyperlinks
and we would obtain the result table shown in Figure 12.3(c).
Getting Your Hands Dirty with SQL
In the next section, we introduce the sqlite3 Python Standard Library module. It provides an application programming interface (API) that enables Python programs to access database files and execute SQL commands on them.
If you cannot wait and want to try running the SQL queries we just described, you can use the SQLite command-line shell. It is a stand-alone program that allows you to interactively execute SQL statements against a database file. You will, however, first need to download the precompiled shell binary from:
www.sqlite.org/download.html
Save the binary executable in a directory that contains the database file you want to work with. We illustrate next the usage of the SQLite command-line shell on database file links.db (whose two tables are shown in Figure 12.2), so we save the executable in the folder containing that file.
To run the SQLite command-line shell, you first need to open the command-line shell of your system. Then, switch the directory to the directory containing the sqlite3 executable and run the code shown to access the database file links.db:
> ./sqlite3 links.db SQLite version 3.7.7.1 Enter “.help” for instructions Enter SQL statements terminated with a “;” sqlite>
(This code works on Unix/Linux/Mac OS X systems; on MS Windows, you should use the command sqlite3.exe links.db.)
At the sqlite> prompt, you can now execute SQL statements against the database file links.db. The only additional requirement is that your SQL statement must be followed by a semicolon (;). For example:
sqlite> SELECT Url, Word FROM Keywords; one.html|Beijing one.html|Paris one.html|Chicago two.html|Bogota two.html|Beijing … five.html|Nairobi five.html|Bogota sqlite>
(A few lines of output have been omitted.) You can use the SQLite command-line shell to execute every SQL statement described in this section.
In order to answer a question such as “In which pages does word X appear in?” we need to make a database query that selects only some records in a table (i.e., those that satisfy a certain condition). The SQL WHERE clause can be added to the SELECT statement to conditionally select records. For example, to select the URLs of web pages containing ‘Paris’, you would use
SELECT Url FROM Keywords WHERE Word = ‘Paris’
The result set returned is illustrated in Figure 12.4(a). Note that string values in SQL also use quotes as delimiters, just as in Python. In general, the format of the SELECT statement with the WHERE clause is:
SELECT column(s) FROM table WHERE column operator value
The condition column operator value restricts the rows to which the SELECT statement is applied to only those that satisfy the condition. Operators that may appear in the condition are shown in Table 12.1. Conditions can be enclosed in parentheses, and logical operators AND and OR can be used to combine two or more conditions. Note: The format of the WHERE clause is slightly different when the BETWEEN operator is used; it is
WHERE column BETWEEN value1 AND value2
Suppose we would like the result set in Figure 12.4(a) to be ordered by the frequency of the word ‘Paris’ in the web page. In other words, suppose the question is “What is the ranking of web pages containing word X, based on the number of occurrences of string X in the page?” To order the records in the result set by a specific column value, the SQL keyword ORDER BY can be used:
SELECT Url,Freq FROM Keywords WHERE Word=‘Paris’ ORDER BY Freq DESC
This statement returns the result set shown in Figure 12.4(b). The keyword ORDER BY is followed by a column name; the records selected will be ordered based on values in that column. The default is an increasing ordering; in the statement, we used keyword DESC (which stands for “descending”) to obtain an ordering that puts the page with most occurrences of ‘Paris’ first.
Write an SQL query that returns:
To answer queries such as “How many pages contain the word Paris?” we need a way to count the number of records obtained through a query. SQL has built-in functions for this purpose. The SQL function COUNT(), when applied to a result table, returns the number of rows in it:
SELECT COUNT(*) FROM Keywords WHERE Word = ‘Paris’
The result table obtained, shown in Figure 12.5(a), contains just one column and one record. Note that the column no longer corresponds to a column of the table on which we made the query.
To answer “What is the total number of occurrences of word Paris across all web pages?” we need to add up the values in column Freq of every row of table Keywords containing ‘Paris’ in the Word column. The SQL function SUM() can be used for this as shown next:
SELECT SUM(Freq) FROM Keywords WHERE Word = ‘Paris’
The result table is illustrated in Figure 12.5(b).
Suppose you now want to know “How many outgoing links does each web page have?” To answer this, you need to add up the number of links for each distinct Url value. The SQL clause GROUP BY groups the records of a table that have the same value in the specified column. The next query will group the rows of table Hyperlinks by Url value and then count the number of rows in each group:
SELECT COUNT(*) FROM Hyperlinks GROUP BY Url
We modify this query slightly to also include the Web page URL:
SELECT Url, COUNT(*) FROM Hyperlinks GROUP BY Url
The result of this query is shown in Figure 12.5(c).
For each question, write an SQL query that answers it:
The result tables for questions (c) and (d) should include the URLs of the web pages.
Suppose we want to know “What web pages have a link to a page containing word ‘Bogota’?” This question requires a lookup of both tables Keywords and Hyperlinks. We would need to look up Keywords to find out the set S of URLs of pages containing word ‘Bogota’, and then look up Keywords to find the URLs of pages with links to pages in S.
The SELECT statement can be used on multiple tables. To understand the behavior of SELECT when used on multiple tables, we develop a few examples. First, the query
SELECT * FROM Hyperlinks, Keywords
returns a table containing 104 records, each a combination of a record in Hyperlinks and a record in Keywords. This table, shown in Figure 12.6 and referred to as a cross join, has five named columns corresponding to the two columns of table Hyperlinks and three columns of table Keywords.
It is, of course, possible to conditionally select some records in the cross join. For example, the next query selects the 16 records (2 of which are shown in Figure 12.6) out of the 104 in the cross join that contain ‘Bogota’ in column Word of table Keywords:
SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = ‘Bogota’
Do pay attention to the syntax of this last SQL query. In a query that refers to columns in multiple tables, you must add the table name and a dot before a column name. This is to avoid confusion if columns in different tables have the same name. To refer to column Word of table Keywords, we must use the notation Keywords.Word.
Here is another example. The next query picks up only those records in the cross join whose Hyperlink.Url and Keyword.Url values match:
SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url
The result of this query is shown in Figure 12.7.
Conceptually, the table in Figure 12.7 consists of records that associate a hyperlink (from Hyperlinks.Url to Hyperlinks.Link) to a word appearing in the web page pointed to by the hyperlink (i.e., the web page with URL Hyperlinks.Link).
Now, our original question was “What web pages have a link to a page containing ‘Bogota’?” To answer this question, we need to select records in the cross join whose Keyword.Word value is ‘Bogota’ and whose Keyword.Url value is equal to the value of Hyperlinks.Link. Figure 12.8 shows these records.
To pick up all the URLs of web pages with a link to a page containing ‘Bogota’, we thus need to make the query shown and illustrated in Figure 12.9.
Of course, before we can make queries to a database, we need to create the tables and insert records into it. When a database file is created, it will be empty and contain no table. The SQL statement CREATE TABLE is used to create a table and has this format:
CREATE TABLE TableName ( Column1 dataType, Column2 dataType, … )
We spread the statement across multiple lines and indent the column definitions for visual appeal, nothing else. We could have also written the whole statement in one line.
For example, to define the table Keywords, we would do:
CREATE TABLE Keywords ( Url text, Word text, Freq int )
The CREATE TABLE statement explicitly specifies the name and data type of each column of the table. Columns Url and Word are of type text, which corresponds to the Python str data type. Column Freq stores integer data. Table 12.2 lists a few SQL data types and the corresponding Python data types.
The SQL statement INSERT is used to insert a new record (i.e., row) into a database table. To insert a complete row, with a value for every column of the database, this format is used:
INSERT INTO TableName VALUES (value1, value2, …)
For example, to insert the first row of table Keywords, you would do
INSERT INTO Keywords VALUES (‘one.html’, ‘Beijing’, 3)
The SQL statement UPDATE is used to modify the data in a table. Its general format is
UPDATE TableName SET column1 = value1 WHERE column2 = value2
If we wanted to update the frequency count of ‘Bogota’ in page two.html, we would update the table Keywords in this way:
UPDATE Keywords SET Freq = 4 WHERE Url = ‘two.html’ AND Word = ‘Bogota’
SQL is specifically designed to access and process data stored in a relational database, that is, a collection of data items stored in tables that can be accessed and processed in various ways. The term relational refers to to the mathematical concept of relation, which is a set of pairs of items or, more generally, tuples of items. A table can thus be viewed as a mathematical relation.
In this text, we have been writing SQL statements in an ad hoc fashion. The advantage of viewing tables through the prism of mathematics is that that the power of abstraction and mathematics can be brought to bear to understand what can be computed using SQL and how. Relational algebra is a branch of mathematics that has been developed for precisely this purpose.
There are good online resources if you would like to learn more about SQL, including
www.w3schools.com/sql/default.asp
With a bit of SQL under our belt, we can now write applications that store data in databases and/or make database queries. In this section, we show how to store the data grabbed by a web crawler into a database and then mine the database in the context of a simple search engine application. We start by introducing the database API we will use to access the database files.
The Python Standard Library includes a database API module sqlite3 that provides Python developers a simple, built-in API for accessing database files. Unlike typical database APIs, the sqlite3 module is not an interface to a separate database engine program. It is an interface to a library of functions called SQLite that accesses the database files directly.
SQLite versus Other Database Management Systems
Application programs do not typically read from and write to database files directly. Instead, they send SQL commands to a database engine or, more formally, a relational database management system (RDBMS). An RDBMS manages the database and accesses the database files on the application's behalf.
The first RDBMS was developed at the Massachusetts Institute of Technology in the early 1970s. Significant RDBMSs in use today include commercial ones by IBM, Oracle, Sybase, and Microsoft as well as open source ones such as Ingres, Postgres, and MySQL. All these engines run as independent programs outside of Python. In order to access them, you must use an API (i.e., a Python module) that provides classes and functions that allow Python applications to send SQL statements to the engine.
SQLite, however, is a library of functions that implements an SQL database engine that executes in the context of the application rather than independent from it. SQLite is extremely lightweight and commonly used by many applications, including the Firefox and Opera browsers, Skype, Apple iOS and Google's Android operating system, to store data locally. For this reason, SQLite is said to be the most widely used database engine.
We now demonstrate the usage of the sqlite3 database API by going over the steps necessary to store word frequencies and hyperlink URLs scanned from a web page into a database. First, we need to create a connection to the database file, which is somewhat equivalent to opening a text file:
>>> import sqlite3 >>> con = sqlite3.connect(‘web.db’)
The function connect() is a function in module sqlite3 that takes as input the name of a database file (in the current working directory) and returns an object of type Connection, a type defined in the module sqlite3. The Connection object is associated with the database file. In the statement, if database file web.db exists in the current working directory, the Connection object con will represent it; otherwise, a new database file web.db is created.
Once we have a Connection object associated with the database, we need to create a cursor object, which is responsible for executing SQL statements. The method cursor() of the Connection class returns an object of type Cursor:
>>> cur = con.cursor()
A Cursor object is the workhorse of database processing. It supports a method which takes an SQL statement, as a string, and executes it: method execute(). For example, to create the database table Keywords, you would just pass the SQL statement, as a string, to the execute() method:
>>> cur.execute(”””CREATE TABLE Keywords (Url text, Word text, Freq int)”””)
Now that we've created table Keywords, we can insert records into it. The SQL INSERT INTO statement is simply passed as an input to the execute() function:
>>> cur.execute(”””INSERT INTO Keywords VALUES (‘one.html’, ‘Beijing’, 3)”””)
In this example, the values inserted into the database (‘one.html’, ‘Beijing’ and 3) are “hardcoded” in the SQL statement string expression. That is not typical, as usually SQL statements executed within a program use values that come from Python variables. In order to construct SQL statements that use Python variable values, we use a technique similar to string formatting called parameter substitution.
Suppose, for example, that we would like to insert a new record into the database, one containing values:
>>> url, word, freq = ‘one.html’, ‘Paris’, 5
We construct the SQL statement string expression as usual, but we put a ? symbol as a placeholder wherever a Python variable value should be. This will be the first argument to the execute() method. The second argument is a tuple containing the three variables:
>>> cur.execute(”””INSERT INTO Keywords VALUES (?, ?, ?)”, (url, word, freq))”””
The value of each tuple variable is mapped to a placeholder as shown in Figure 12.10.
We can also assemble all the values into a tuple object beforehand:
>>> record = (‘one.html’,‘Chicago’, 5) >>> cur.execute(”INSERT INTO Keywords VALUES (?, ?, ?)”, record)
Security Issue: SQL Injection
It is possible to construct SQL statement string expressions using format strings and the string format() method. That is, however, insecure, as it is vulnerable to a security attack called an SQL injection attack. You should definitely not use format strings to construct SQL expressions.
Changes to a database file—including table creation or deletion or row insertions and updates—are not actually written to the database file immediately. They are only recorded temporarily, in memory. In order to ensure that the changes are written, you must commit to the changes by having the Connection object invoke the commit() method:
>>> con.commit()
When you are done working with a database file, you need to close it just as you would close a text file. The Connection object invokes the close() method to close the database file:
>>> con.close()
Implement function webData() that takes as input:
The database file should contain tables named Keywords and Hyperlinks defined as illustrated in Figures 12.2(a) and (b). Your function should insert a row into table Hyperlinks for every link in the list, and a row into table Keywords for every (word, frequency) pair in the dictionary. You should also commit and close the database file when done.
We now show how to make SQL queries from within a Python program. We make queries against database file links.db, which contains the tables Hyperlinks and Keywords shown in Figure 12.2.
>>> import sqlite3 >>> con = sqlite3.connect(‘links.db’) >>> cur = con.cursor()
To execute an SQL SELECT statement, we just need to pass the statement, as a string, to the cursor's execute() method:
>>> cur.execute(‘SELECT * FROM Keywords’)
The SELECT statement should return a result table. So where is it?
The table is stored in the Cursor object cur itself. If you want it, you need to fetch it, which you can do in several ways. To obtain the selected records as a list of tuples, you can use the fetchall() method (of the Cursor class):
>>> cur.fetchall() [(‘one.html’, ‘Beijing’, 3), (‘one.html’, ‘Paris’, 5), (‘one.html’, ‘Chicago’, 5), (‘two.html’, ‘Bogota’, 3) … (‘five.html’, ‘Bogota’, 2)]
The other option is to treat the Cursor object cur as an iterator and iterate over it directly:
>>> cur.execute(‘SELECT * FROM Keywords’) <sqlite3.Cursor object at 0x15f93b0> >>> for record in cur: print(record) (‘one.html’, ‘Beijing’, 3) (‘one.html’, ‘Paris’, 5) … (‘five.html’, ‘Bogota’, 2)
The second approach has the advantage of being memory efficient because no large list is stored in memory.
What if a query uses a value stored in a Python variable? Suppose we would like to learn what web pages contain the value of word, where word is defined as:
>>> word = ‘Paris’
Once again, we can use parameter substitution:
>>> cur.execute(‘SELECT Url FROM Keywords WHERE Word = ?’, (word,)) <sqlite3.Cursor object at 0x15f9b30>
The value of word is placed into the SQL query at the placeholder position. Let's check that the query does find all the web pages containing the word ‘Paris’:
>>> cur.fetchall() [(‘one.html’,), (‘two.html’,), (‘four.html’,)]
Let's try an example that uses values of two Python variables. Suppose we want to know the URLs of web pages containing more than n occurrences of word, where:
>>> word, n = ‘Beijing’, 2
We again use parameter substitution, as illustrated in Figure 12.11:
>>> cur.execute(”””SELECT * FROM Keywords WHERE Word = ? AND Freq > ?”””, (word, n)) <sqlite3.Cursor object at 0x15f9b30>
Two Cursor Pitfalls
If, after executing the cur.execute() statement, you run
>>> cur.fetchall() [(‘one.html’, ‘Beijing’, 3), (‘three.html’, ‘Beijing’, 6)]
you will get the expected result table. If, however, you run cur.fetchall() again:
>>> cur.fetchall() []
you get nothing. The point is this: The fetchall() method will empty the Cursor object buffer. This is also true if you fetch the records in the result table by iterating over the Cursor object.
Another problem occurs if you execute an SQL query without fetching the result of the previous query:
>>> cur.execute(”””SELECT Url FROM Keywords WHERE Word = ‘Paris’”””) <sqlite3.Cursor object at 0x15f9b30> >>> cur.execute(”””SELECT Url FROM Keywords WHERE Word = ‘Beijing’”””) <sqlite3.Cursor object at 0x15f9b30> >>> cur.fetchall() [(‘one.html’,), (‘two.html’,), (‘three.html’,)]
The fetchall() call returns the result of the second query only. The result of the first is lost!
A search engine is server application that takes a keyword from a user and returns the URLs of web pages containing the keyword, ranked according to some criterion. In this practice problem, you are asked to develop a simple search engine that ranks web pages based on its frequency.
Write a search engine application based on the results of a web crawl that stored word frequencies in a database table Keywords just like the one in Figure 12.2(b). The search engine will take a keyword from the user and simply return the web pages containing the keyword, ranked by the frequency of the keyword, in decreasing order.
>>> freqSearch(‘links.db’) Enter keyword: Paris URL FREQ one.html 5 four.html 2 two.html 1 Enter keyword:
In this section we showcase MapReduce, a framework for data processing developed by Google. Its key feature is that it is scalable, that is, it is able to process very large data sets. It is robust enough to process large data sets using multiple compute nodes, whether the compute nodes are cores on one microprocessor or computers in a cloud computing platform. In fact, we show in the next section how to extend the framework we develop here to utilize all the cores of your computer's microprocessor.
In order to keep our MapReduce implementation as simple as possible, we introduce a new Python construct, list comprehension. Both list comprehension and the MapReduce framework have their origins in the functional programming language paradigm, which we describe briefly.
When you open a text file and use method readlines() to read the file, you will obtain a list of lines. Each line in the list ends with the new line character . Suppose, for example, that list lines was obtained that way:
>>> lines [‘First Line ’,‘Second ’,‘ ’, ‘and Fourth. ’]
In a typical application, character gets in the way of processing the lines, and we need to remove it. One way to do this would be to use a for loop and the familiar accumulator pattern:
>>> newlines = [] >>> for i in range(len(lines)): newlines.append(lines[i][:-1])
In each iteration i of the for loop, the last character of line i (the new line character ) is removed and the modified line is added to accumulator list newlines:
>>> newlines [‘First Line’, ‘Second’, ‘’, ‘and Fourth.’]
There is another way to accomplish the same task in Python:
>>> newlines = [line[:-1] for line in lines] >>> newlines [‘First Line’, ‘Second’, ‘’, ‘and Fourth.’]
The Python statement [line[:-1] for line in lines] constructs a new list from list lines and is Python's list comprehension construct. Here is how it works. Every item line in list lines is used in order from left to right to generate an item in the new list by applying line[:-1] to line. The order in which the items appear in the new list corresponds to the order in which the corresponding items appear in the original list lines (see Figure 12.12).
More generally, a list comprehension statement has this syntax:
[<expression> for <item> in <sequence/iterator>]
This statement evaluates into a list whose items are obtained by applying <expression>, a Python expression typically involving variable <item>, to each item of iterable container <sequence/iterator>. An even more general version may also include an optional conditional expression:
[<expression> for <item> in <sequence/iterator> if <condition>]
In this case, the list obtained has elements that are obtained by applying expression to each item of sequence/iterator for which condition is true.
Let's try a few examples. In the next modification of the last example, the new list will not contain blank strings that correspond to blank lines in the original file:
>>> [line[:-1] for line in lines if line! = ‘ ’] [‘First Line’, ‘Second’, ‘and Fourth.’]
In the next example, we construct a list of even numbers up to 20:
>>> [i for i in range(0, 20, 2)] [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
In the last example, we compute the lengths of the strings in a list:
>>> [len(word) for word in [‘hawk’, ‘hen’, ‘hog’, ‘hyena’]] [4, 3, 3, 5]
Let the list of strings words be defined as:
>>> words = [‘hawk’, ‘hen’, ‘hog’, ‘hyena’]
Write list comprehension statements that use words as the original list to construct lists:
The list in (c) requires some explanation. For every string s of the original list a new list of tuples is created, such that each tuple maps a character of the string s to the string s itself.
Functional Programming
List comprehension is a programming construct borrowed from functional programming languages. With origins in the SETL and NPL programming languages, list comprehension became more widely known when incorporated in the functional programming language Haskell and, especially, Python.
The functional language paradigm differs from the imperative, declarative, and the object-oriented paradigm in that it does not have “statements,” only expressions. A functional language program is an expression that consists of a function call that passes data and possible other functions as arguments. Examples of functional programming languages include Lisp, Scheme, ML, Erlang, Scala, and Haskel.
Python is not a functional language, but it borrows a few functional language constructs that help create cleaner, shorter Python programs.
We consider, one last time, the problem of computing the frequency of every word in a string. We have used this example to motivate the dictionary container class and also to develop a very simple search engine. We use this problem now to motivate a new approach, called MapReduce, developed by Google for solving data processing problems.
Suppose we would like to compute the frequency of every word in the list
>>> words = [‘two’, ‘three’, ‘one’, ‘three’, ‘three’, ‘five’, ‘one’, ‘five’]
The MapReduce approach for doing this takes three steps.
In the first step, we create a tuple (word, 1) for every word in the list words. The pair (word, 1) is referred to as a (key, value) pair, and the value of 1 for every key word captures the count of that particular instance of a word. Note that there is a (word, 1) pair for every occurrence of word in the original list words.
Each (key, value) pair is stored in its own list, and all these single-element lists are contained in the list intermediate1, as shown in Figure 12.13.
The intermediate step of MapReduce pulls together all (word, 1) pairs with the same word key and create a new (key, value) pair (word, [1,1,…,1]) where [1,1,…,1] is a list of all the values 1 pulled together. Note that there is a 1 in [1,1,…,1] for every occurrence of word in the original list words. We refer to the list of (key, value) pairs obtained in this intermediate step as intermediate2 (see Figure 12.13).
In the final step, a new pair (word, count) is constructed by adding up all the 1 s in every (word, [1,1,…,1]) of intermediate2, as shown in Figure 12.13. We call this final list of (key, value) pairs result.
Let's see now how to do these steps in Python. The first step consists of constructing a new list from list words by applying function occurrence() to every word in list words:
Module: ch12.py
1 def occurrence(word): 2 ‘returns list containing tuple (word, 1)’ 3 return [(word, 1)]
Using list comprehension, we can express the first step of MapReduce succinctly:
>>> intermediate1 = [occurrence(word) for word in words] >>> intermediate1 [[(‘two’, 1)], [(‘three’, 1)], [(‘one’, 1)], [(‘three’, 1)], [(‘three’, 1)], [(‘five’, 1)], [(‘one’, 1)], [(‘five’, 1)]]
This step is referred to as the Map step of MapReduce, and function occurrence() is said to be the map function of the word frequency problem.
Map Step Returns a List of Tuples
The function occurrence() returns a list containing just one tuple. You may wonder why it does not return just the tuple itself.
The reason is that our goal is not just to solve the word frequency problem. Our goal is to develop a general framework that can be used to solve a range of problems. For problems other than the word frequency problem, the Map step may return more than one tuple. We will see an example of this later in this section. So we insist that the map function returns a list of tuples.
The intermediate step of MapReduce, called the Partition step, pulls together all pairs
(key, value1), (key, value2), … (key, valuek)
contained in (sublists of) intermediate1 with the same key. For each unique key, a new (key, values) pair is constructed where values is the list [value1, value2,…, valuek]. This step is encapsulated in function partition():
Module: ch12.py
1 def partition(intermediate1): 2 ‘‘‘intermediate1 is a list containing [(key, value)] lists; 3 returns iterable container with a (key, values) tuple for 4 every unique key in intermediate1; values is a list that 5 contains all values in intermediate1 associated with key 6 ’’’ 7 dct = {} # dictionary of (key, value) pairs 8 9 # for every (key, value) pair in every list of intermediate1 10 for lst in intermediate1: 11 for key, value in lst: 12 13 if key in dct: # if key already in dictionary dct, add 14 dct[key].append(value) # value to list dct[key] 15 else: # if key not in dictionary dct, add 16 dct[key] = [value] # (key, [value]) to dct 17 18 return dct.items() # return container of (key, values) tuples
Function partition() takes list intermediate1 and constructs list intermediate2:
>>> intermediate2 = partition(intermediate1) >>> intermediate2 dict_items([(‘one’, [1, 1]), (‘five’, [1, 1]), (‘two’, [1]), (‘three’, [1, 1, 1])])
Finally, the last step consists of constructing a new (key, count) pair from each (key, values) pair of intermediate2 by just accumulating the values in values:
Module: ch12.py
1 def occurrenceCount(keyVal): 2 return (keyVal[0], sum(keyVal[1]))
Again, list comprehension provides a succinct way to perform this step:
>>> [occurrenceCount(x) for x in intermediate2] [(‘six’, 1), (‘one’, 2), (‘five’, 2), (‘two’, 1), (‘three’, 3)]
This is referred to as the Reduce step of MapReduce. The function occurrenceCount() is referred to as the reduce function for the word frequency problem.
The MapReduce approach we used to compute word frequencies in the previous section may seem like an awkward and strange way to compute word frequencies. It can be viewed, as a more complicated version of the dictionary-based approach we have seen in Chapter 6. There are, however, benefits to the MapReduce approach. The first benefit is that the approach is general enough to apply to a range of problems. The second benefit is that it is amenable to an implementation that uses not one but many compute nodes, whether it is several cores on a central processing unit (CPU) or thousands in a cloud computing system.
We go into the second benefit in more depth in the next section. What we would like to do now is abstract the MapReduce steps so the framework can be used in a range of different problems, by simply defining the problem specific map and reduce functions. In short, we would like to develop a class SeqMapReduce that can be used to compute word frequencies as easily as this:
>>> words = [‘two’, ‘three’, ‘one’, ‘three’, ‘three’, ‘five’, ‘one’, ‘five’] >>> smr = SeqMapReduce(occurrence, occurrenceCount) >>> smr.process(words) [(‘one’, 2), (‘five’, 2), (‘two’, 1), (‘three’, 3)]
We can then use the SeqMapReduce object smr to compute frequencies of other things. For example, numbers:
>>> numbers = [2,3,4,3,2,3,5,4,3,5,1] >>> smr.process(numbers) [(1, 1), (2, 2), (3, 4), (4, 2), (5, 2)]
Furthermore, by specifying other, problem-specific, map and reduce functions, we can solve other problems.
These specifications suggest that the class SeqMapReduce should have a constructor that takes the map and reduce functions as input. The method process should take an iterable sequence containing data and perform the Map, Partition, and Reduce steps:
Module: ch12.py
1 class SeqMapReduce(object): 2 ‘a sequential MapReduce implementation’ 3 def _ _init_ _(self, mapper, reducer): 4 ‘functions mapper and reducer are problem specific’ 5 self.mapper = mapper 6 self.reducer = reducer 7 def process(self, data): 8 ‘runs MapReduce on data with mapper and reducer functions’ 9 intermediate1 = [self.mapper(x) for x in data] # Map 10 intermediate2 = partition(intermediate1) 11 return [self.reducer(x) for x in intermediate2] # Reduce
Input to MapReduce Should Be Immutable
Suppose we would like to compute frequencies of sublists in list lists:
>>> lists = [[2,3], [1,2], [2,3]]
It would seem that the same approach we used to count strings and numbers would work:
>>> smr = SeqMapReduce(occurrence, occurrenceCount) >>> smr.process(lists) Traceback (most recent call last): … TypeError: unhashable type: ‘list’
So … what happened? The problem is that lists cannot be used as keys of a the dictionary inside function partition(). Our approach can work only with hashable, immutable data types. By changing the lists to tuples, we are back in business:
>>> lists = [(2,3), (1,2), (2,3)] >>> m.process(lists) [((1, 2), 1), ((2, 3), 2)]
We now apply the MapReduce framework to solve the inverted index problem (also referred to as the reverse index problem). There are many versions of this problem. The one we consider is this: Given a bunch of text files, we are interested in finding out which words appear in which file. A solution to the problem could be represented as a mapping that maps each word to the list of files containing it. This mapping is called an inverted index.
For example, suppose we want to construct the inverted index for text files a.txt, b.txt, and c.txt shown in Figure 12.14.
An inverted index would map, say, ‘Paris’ to list [‘a.txt’, ‘c.txt’] and ‘Quito’ to [‘b.txt’]. The inverted index should thus be:
[(‘Paris’, [‘c.txt’, ‘a.txt’]), (‘Miami’, [‘a.txt’]), (‘Cairo’, [‘c.txt’]), (‘Quito’, [‘b.txt’]), (‘Tokyo’, [‘a.txt’, ‘b.txt’])]
To use MapReduce to obtain the inverted index, we must define the map and reduce functions that will take the list of file names
[‘a.txt’, ‘b.txt’, ‘c.txt’]
and produce the inverted index. Figure 12.15 illustrates how these functions should work.
In the Map phase, the map function creates a list for every file. This list contains a tuple (word, file) for every word word in the file. Function getWordsFromFile() implements the map function:
Module: ch12.py
1 from string import punctuation 2 def getWordsFromFile(file): 3 ‘‘‘returns list of items [(word, file)] 4 for every word in file’’’ 5 infile = open(file) 6 content = infile.read() 7 infile.close() 8 9 # remove punctuation (covered in Section 4.1) 10 transTable = str.maketrans(punctuation, ‘ ’*len(punctuation)) 11 content = content.translate(transTable) 12 13 # construct set of items [(word, file)] with no duplicates 14 res = set() 15 for word in content.split(): 16 res.add((word, file)) 17 return res
Note that this map function returns a set, not a list. That is not a problem because the only requirement is that the returned container is iterable. The reason we use a set is to ensure there are no duplicate entries [(word, file)], as they are not necessary and will only slow down the Partition and Reduce steps.
After the Map step, the partition function will pull together all tuples (word, file) with the same value of word and merge them into one tuple (word, files), where files is the list of all files containing word. In other words, the partition function constructs the inverted index.
This means that the Reduce step does not need to do anything. The reduce function just copies items to the result list, the inverted index.
Module: ch12.py
1 def getWordIndex(keyVal): 2 return (keyVal)
To compute the inverted index, you only need to do:
File: a.txt, b.txt, c.txt
>>> files = [‘a.txt’, ‘b.txt’, ‘c.txt’] >>> print(SeqMapReduce(getWordsFromFile, getWordIndex). process(files)) [(‘Paris’, [‘c.txt’, ‘a.txt’]), (‘Miami’, [‘a.txt’]), (‘Cairo’, [‘c.txt’]), (‘Quito’, [‘c.txt’, ‘b.txt’]), (‘Tokyo’, [‘a.txt’, ‘b.txt’])]
Develop a MapReduce-based solution constructing an inverted “character index” of a list of words. The index should map every character appearing in at least one of the words to a list of words containing the character. Your work consists of designing the mapper function getChars() and reducer function getCharIndex().
>>> mp = SeqMapReduce(getChars, getCharIndex) >>> mp.process([‘ant’, ‘bee’, ‘cat’, ‘dog’, ‘eel’]) [(‘a’, [‘ant’, ‘cat’]), (‘c’, [‘cat’]), (‘b’, [‘bee’]), (‘e’, [‘eel’, ‘bee’]), (‘d’, [‘dog’]), (‘g’, [‘dog’]), (‘l’, [‘eel’]), (‘o’, [‘dog’]), (‘n’, [‘ant’]), (‘t’, [‘ant’, ‘cat’])]
Today's computing often requires the processing of a tremendous amount of data. A search engine continuously extracts information out of billions of web pages. Particle physics experiments run at the Large Hadron Collider near Geneva, Switzerland, generate petabytes of data per year that must be processed to answer basic questions about the universe. Companies like Amazon, eBay, and Facebook keep track of millions of transactions daily and use them in their data mining applications.
No computer is powerful enough to tackle the type of problems we have just described by itself. Today, large data sets are processed in parallel using many, many processors. In this section, we introduce parallel programming and a Python API that enables us to take advantage of the multiple cores available on most current computers. While the practical details of parallel computing on a distributed system is beyond the scope of this text, the general principles we introduce in this chapter apply to such computing as well.
For several decades and until the mid-2000s, microprocessors on most personal computers had a single core (i.e., processing unit). That meant that only one program could execute at a time on such machines. Starting in the mid-2000s, major microprocessor manufacturers such as Intel and AMD started selling microprocessors with multiple processing units, typically referred to as cores. Almost all personal computers sold today and many wireless devices have microprocessors with two or more cores. The programs we have developed so far have not made use of more than one core. To take advantage of them, we need to use one of the Python parallel programming APIs.
Moore's Law
Intel cofounder Gordon Moore predicted in 1965 that the number of transistors on a microprocessor chip would double about every two years. Amazingly, his prediction has held up so far. Thanks to the exponential increase in transistor density, the processing power of microprocessors, measured in the number of instructions per second, has seen tremendous growth over the last several decades.
Increasing transistor density can improve the processing power in two ways. One way is to use the fact that if transistors are closer together, then the instructions can execute quicker. We can thus reduce the time between the execution of instructions (i.e., increase the processor clock rate). Until the mid-2000s, that was exactly what microprocessor manufacturers were doing.
The problem with increasing the clock rate is that it also increases power consumption, which in turn creates problems such as overheating. The other way to increase processing power is to reorganize the denser transistors into multiple cores that can execute instructions in parallel. This approach also ends up increasing the number of instructions that can be executed per second. Recently, processor manufacturers have begun using this second approach, producing processors with two, four, eight, and even more cores. This fundamental change in the architecture of microprocessors is an opportunity but also a challenge. Writing programs that use multiple cores is more complex than single-core programming.
If your computer has a microprocessor with multiple cores, you can split the execution of some Python programs into several tasks, which can be run in parallel by different cores. One way to do this in Python is by using the Standard Library module multiprocessing.
If you do not know the number of cores on your computer, you can use the function cpu_count() defined in module multiprocessing to find out:
>>> from multiprocessing import cpu_count >>> cpu_count() 8
Your computer may have fewer cores, or more! With eight cores, you could, theoretically, execute programs eight times faster. To achieve that speed, you would have to split the problem you are solving into eight pieces of equal size and then let each core handle a piece in parallel. Unfortunately, not all problems can be broken into equal-size pieces. But there are problems, especially data processing problems, that can be, and they are motivating this discussion.
We use the class Pool in module multiprocessing to split a problem and execute its pieces in parallel. A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on an available processor core.
A process is typically defined as a “program in execution.” But what does that really mean? When a program executes on a computer, it executes in an “environment” that keeps track of all the program instructions, variables, program stack, the state of the CPU, and so on. This “environment” is created by the underlying operating system to support the execution of the program. This “environment” is what we refer to as a process.
Modern computers are multiprocessing, which means that they can run multiple programs or, more precisely, multiple processes concurrently. The term concurrently does not really mean “at the same time.” On a single-core microprocessor computer architecture, only one process can really be executing at a given point. What concurrently means in that case is that at any given point in time, there are multiple processes (programs in execution), one of which is actually using the CPU and making progress; the other processes are interrupted, waiting for the CPU to be allocated to them by the operating system. In a multicore computer architecture, the situation is different: Several processes can truly run at the same time, on different cores.
We illustrate the usage of the class Pool in a simple example:
Module: parallel.py
1 from multiprocessing import Pool 2 3 pool = Pool(2) # create pool of 2 processes 4 5 animals = [‘hawk’, ‘hen’, ‘hog’, ‘hyena’] 6 res = pool.map(len, animals) # apply len() to every animals item 7 8 print(res) # print the list of string lengths
This program uses a pool of two processes to compute the lengths of strings in list animals. When you execute this program in your system's command shell (not the Python interactive shell), you get:
> python parallel.py [4, 3, 3, 5]
So, in program parallel.py, the map() method applies function len() to every item of list animals and then returns a new list from the values obtained. Expression
pool.map(len, animals)
and the list comprehension expression
[len(x) for x in animals]
really do the same thing and evaluate to the same value. The only difference is how they do it.
In the Pool-based approach, unlike the list comprehension approach, two processes are used to apply the function len() to each item of list animals. If the host computer has at least two cores, the processor can execute the two processes at the same time (i.e., in parallel).
To demonstrate that the two processes execute at the same time, we modify the program parallel.py to explicitly show that different processes handle different items of list animal. To differentiate between processes, we use the convenient fact that every process has a unique integer ID. The ID of process can be obtained using the getpid() function of the os Standard Library module:
Module: parallel2.py
1 from multiprocessing import Pool 2 from os import getpid 3 4 def length(word): 5 ‘returns length of string word’ 6 7 # print the id of the process executing the function 8 print(‘Process {} handling {}’.format(getpid(), word)) 9 return len(word) 10 11 # main program 12 pool = Pool(2) 13 res = pool.map(length, [‘hawk’, ‘hen’, ‘hog’, ‘hyena’]) 14 print(res)
The function length() takes a string and returns its length, just like len(); it also prints the ID of the process executing the function. When we run the previous program at the command line (not in the Python interactive shell), we get something like:
> python parallel2.py Process 36715 handling hawk Process 36716 handling hen Process 36716 handling hyena Process 36715 handling hog [4, 3, 3, 5]
Thus, the process with ID 36715 handled strings ‘hawk’ and ‘hog’, while the process with ID 36716 handled strings ‘hen’ and ‘hyena’. On a computer with multiple cores, the processes can execute completely in parallel.
Why Don't We Run Parallel Programs in the Interactive Shell?
For technical reasons that go beyond the scope of this book, it is not possible, on some operating system platforms, to run programs using Pool in the interactive shell. For this reason, we run all programs that use a pool of processes in the command-line shell of the host operating system.
To change the pool size in parallel2.py, you only need to change the input argument of the Pool constructor. When a pool is constructed with the default Pool() constructor (i.e., when the pool size is not specified) Python will decide on its own how many processes to assign. It will not assign more processes than there are cores on the host system.
Write program notParallel.py that is a list comprehension version of parallel2.py. Run it to check how many processes it uses. Then run parallel2.py several times, with a pool size of 1, 3, and then 4. Also run it with the default Pool() constructor.
To illustrate the benefit of parallel computing, we consider a computationally intensive problem from number theory. We would like to compare the distribution of prime numbers in several arbitrary ranges of integers. More precisely, we want to count the number of prime numbers in several equal-size ranges of 100,000 large integers.
Suppose one of the ranges is from 12,345,678 up to but not including 12,445,678. To find the prime numbers in this range, we can simply iterate through the numbers in the range and check each whether it is prime. Function countPrimes() implements this idea using list comprehension:
Module: primeDensity.py
1 from os import getpid 2 3 def countPrimes(start): 4 ‘returns the number of primes in range [start, start+rng)’ 5 6 rng = 100000 7 formatStr = ‘process {} processing range [{}, {})’ 8 print(formatStr.format(getpid(), start, start+rng)) 9 10 # sum up numbers i in range [start, start_rng) that are prime 11 return sum([1 for i in range(start,start+rng) if isprime(i)])
The function isprime() takes a positive integer and returns True if it is prime, False otherwise. It is the solution to Problem 5.36. We use the next program to compute the execution time of function countPrimes():
Module: primeDensity.py
1 from multiprocessing import Pool 2 from time import time 3 4 if _ _name_ _ == ‘_ _main_ _’: 5 6 p = Pool() 7 # starts is a list of left boundaries of integer ranges 8 starts = [12345678, 23456789, 34567890, 45678901, 9 56789012, 67890123, 78901234, 89012345] 10 11 t1 = time() # start time 12 print(p.map(countPrimes,starts)) # run countPrimes() 13 t2 = time() # end time 14 15 p.close() 16 print(‘Time taken: {} seconds.’.format(t2-t1))
If we modify the line p = Pool() to p = Pool(1), and thus have a pool with only one process, we get this output:
> python map.py process 4176 processing range [12345678, 12445678] process 4176 processing range [23456789, 23556789] process 4176 processing range [34567890, 34667890] process 4176 processing range [45678901, 45778901] process 4176 processing range [56789012, 56889012] process 4176 processing range [67890123, 67990123] process 4176 processing range [78901234, 79001234] process 4176 processing range [89012345, 89112345] [6185, 5900, 5700, 5697, 5551, 5572, 5462, 5469] Time taken: 47.84 seconds.
In other words, a single process handled all eight integer ranges and took 47.84 seconds. (The run time will likely be different on your machine.) If we use a pool of two processes, we get a dramatic improvement in running time: 24.60 seconds. So by using two processes running on two cores instead of just one process, we decreased the running time by almost one-half.
A better way to compare sequential and parallel running times is the speedup, that is, the ratio between the sequential and the parallel running times. In this particular case, we have a speedup of
What this means is that with two processes (running on two separate cores), we solved the problem 1.94 times faster, or almost twice as fast. Note that this is, essentially, the best we can hope for: two processes executing in parallel can be at most twice as fast as one process.
With four processes, we get further improvement in running time: 16.78 seconds, which corresponds to a speedup of 47.84/16.78 ≈ 2.85. Note that the best possible speedup with four processes running on four separate cores is 4. With eight processes, we get some further improvement in running time: 14.29 seconds, which corresponds to a speedup of 47.84/14.29 ≈ 3.35. The best possible is, of course, 8.
With a parallel version of list comprehension in our hands, we can modify our first, sequential MapReduce implementation to one that can run the Map and Reduce steps in parallel. The only modification to the constructor is the addition of an optional input argument: the desired number of processes.
Module: ch12.py
1 from multiprocessing import Pool 2 class MapReduce(object): 3 ‘a parallel implementation of MapReduce’ 4 5 def _ _init_ _(self, mapper, reducer, numProcs = None): 6 ‘initialize map and reduce functions, and process pool’ 7 self.mapper = mapper 8 self.reducer = reducer 9 self.pool = Pool(numProcs)
The method process() is modified so it uses the Pool method map() instead of list comprehension in the Map and Reduce steps.
Module: ch12.py
1 def process(self, data): 2 ‘runs MapReduce on sequence data’ 3 4 intermediate1 = self.pool.map(self.mapper, data) # Map 5 intermediate2 = partition(intermediate1) 6 return self.pool.map(self.reducer, intermediate2) # Reduce
We use the parallel implementation of MapReduce to solve the a name cross-checking problem. Suppose that tens of thousands of previously classified documents have just been posted on the web and that the documents mention various people. You are interested in finding which documents mention a particular person, and you want to do that for every person named in one or more documents. Conveniently, people's names are capitalized, which helps you narrow down the words that can be proper names.
The precise problem we are then going to solve is this: Given a list of URLs (of the documents), we want to obtain a list of pairs (proper, urlList) in which proper is a capitalized word in any document and urlList is a list of URLs of documents containing proper. In order to use MapReduce, we need to define the map and reduce functions.
The map function takes a URL and should produce a list of (key, value) pairs. In this particular problem, there should be a (key, value) pair for every capitalized word in the document that the URL identifies, with the word being the key and the URL being the value. So the map function is then:
Module: crosscheck.py
1 from urllib.request import urlopen 2 from re import findall 3 4 def getProperFromURL(url): 5 ‘‘‘returns list of items [(word, url)] for every word 6 in the content of web page associated with url’’’ 7 8 content = urlopen(url).read().decode() 9 pattern = ‘[A-Z][A-Za-z'-]*’ # RE for capitalized words 10 propers = set(findall(pattern, content)) # removes duplicates 11 12 res = [] # for every capitalized word 13 for word in propers: # create pair (word, url) 14 res.append((word, url)) 15 return res
A regular expression, defined in line 8, is used to find capitalized words in line 9. (To review regular expressions, see Section 11.3.) Duplicate words are removed by converting the list returned by re function findall() to a set; we do that because duplicates are not needed and to speed up the Partition and Reduce steps that follow.
The Partition step of MapReduce takes the output of the Map step and pulls together all the (key, value) pairs with the same key. In this particular problem, the result of the Partition step is a list of pairs (word, urls) for every capitalized word; urls refers to the list of URLs of documents containing word. Since these are exactly the pairs we need, no further processing is required in the Reduce step:
Module: ch12.py
1 def getWordIndex(keyVal): 2 ‘returns input value’ 3 return keyVal
How do our sequential and parallel implementations compare? In the next code, we develop a test program that compares the running times of the sequential implementation and a parallel implementation with four processes. (The tests were run on a machine with eight cores.) Instead of classified documents we use, as our test bed, eight novels by Charles Dickens, publicly made available by the Project Gutenberg:
Module: ch12.py
1 from time import time 2 3 if _ _name_ _ == ‘_ _main_ _’: 4 5 urls = [ # URLs of eight Charles Dickens novels 6 ‘http://www.gutenberg.org/cache/epub/2701/pg2701.txt’, 7 ‘http://www.gutenberg.org/cache/epub/1400/pg1400.txt’, 8 ‘http://www.gutenberg.org/cache/epub/46/pg46.txt’, 9 ‘http://www.gutenberg.org/cache/epub/730/pg730.txt’, 10 ‘http://www.gutenberg.org/cache/epub/766/pg766.txt’, 11 ‘http://www.gutenberg.org/cache/epub/1023/pg1023.txt’, 12 ‘http://www.gutenberg.org/cache/epub/580/pg580.txt’, 13 ‘http://www.gutenberg.org/cache/epub/786/pg786.txt’] 14 15 t1 = time() # sequential start time 16 SeqMapReduce(getProperFromURL, getWordIndex).process(urls) 17 t2 = time() # sequential stop time, parallel start time 18 MapReduce(getProperFromURL, getWordIndex, 4).process(urls) 19 t3 = time() # parallel stop time 20 21 print(‘Sequential: {:5.2f} seconds.’.format(t2-t1)) 22 print(‘Parallel: {:5.2f} second.’.format(t3-t2))
Let's run this test:
> python properNames.py Sequential: 19.89 seconds. Parallel: 14.81 seconds.
So, with four cores, we decreased the running time by 5.08 seconds, which corresponds to a speedup of
The best possible speedup with four cores is 4. In the previous example, we are using four cores to get a speedup of 1.34, which is not close to the theoretically best speedup of 4.
Why Cannot We Get a Better Speedup?
One reason we cannot get a better speedup is that there is always overhead when running a program in parallel. The operating system has extra work to do when managing multiple processes running of separate cores. Another reason is that while our parallel MapReduce implementation executes the Map and Reduce steps in parallel, the Partition step is still sequential. On problems that produce very large intermediate lists to be processed in the Partition step, the Partition step will take the same long time as on the sequential implementation. This effectively reduces the benefit of parallel Map and Reduce steps.
It is possible do the Partition step in parallel, but to do so you would need access to an appropriately configured distributed file system of the kind Google uses. In fact, this distributed file system is the real contribution made by Google in developing the MapReduce framework. To learn more about it, you can read the original Google paper that describes the framework:
http://labs.google.com/papers/mapreduce.html
In Practice Problem 12.8, you will develop a program that has a more time-intensive Map step and a less intensive Partition step; you should see a more impressive speedup.
You are given a list of positive integers, and you need to compute a mapping that maps a prime number to those integers in the list that the prime number divides. For example, if the list is [24,15,35,60], then the mapping is
[(2, [24, 60]), (3, [15, 60]), (5, [15, 35]), (7, [35])]
(Prime number 2 divides 24 and 60, prime number 3 divides 15 and 60, etc.)
You are told that your application will get very large lists of integers as input. Therefore, you must use the MapReduce framework to solve this problem. In order to do so, you will need to develop a map and a reduce function for this particular problem. If named mapper() and reducer(), you would use them in this way to get the mapping described:
>>> SeqMapReduce(mapper, reducer).process([24,15,35,60])
After implementing the map and reduce functions, compare the running times of your sequential and parallel MapReduce implementations, and compute the speedup, by developing a test program that uses a random sample of 64 integers between 10,000,000 and 20,000,000. You may use the sample() function defined in the module random().
This chapter focuses on modern approaches to processing data. Behind almost every modern “real” computer application, there is a database. Database files are often more suitable than general-purpose files for storing data. This is why it is important to get an early exposure to databases, understand their benefits, and know how to use them.
This chapter introduces a small subset of SQL, the language used to access database files. We also introduce the Python Standard Library module sqlite3, which is an API for working with database files. We demonstrate the usage of SQL and the sqlite3 module in the context of storing the results of a web crawl in a database file and then making search engine-type queries.
Scalability is an important issue with regard to data processing. The amount of data generated and processed by many current computer applications is huge. Not all programs can scale and handle large amounts of data, however. We are thus particularly interested in programming approaches that can scale (i.e., that can be run in parallel on multiple processors or cores).
We introduce in this chapter several scalable programming techniques that have their roots in functional languages. We introduce first list comprehensions, a Python construct that enables, using a succinct description, the execution of a function on every item of a list. We then introduce the function map(), defined in the Standard Library module multiprocessing, that essentially enables the execution of list comprehensions in parallel using the available cores of a microprocessor. We then build on this to describe and develop a basic version of Google's MapReduce framework. This framework is used by Google and other companies to process really big data sets.
While our programs are implemented to run on a single computer, the concepts and techniques introduced in this chapter apply to distributed computing in general and especially to modern cloud computing systems.
12.3 Make sure you use parameter substitution correctly, and do not forget to commit and close:
import sqlite3 def webData(db, url, links, freq): ‘‘‘db is the name of a database file containing tables Hyperlinks and Keywords; url is the URL of a web page; links is a list of hyperlink URLs in the web page; freq is a dictionary that maps each word in the web page to its frequency; webData inserts row (url, word, freq[word]) into Keywords for every keyword in freq, and record (url, link) into Hyperlinks, for every links in links ’’’ con = sqlite3.connect(db) cur = con.cursor() for word in freq: record = (url, word, freq[word]) cur.execute(”INSERT INTO Keywords VALUES (?,?,?)”, record) for link in links: record = (url, link) cur.execute(”INSERT INTO Keywords VALUES (?,?)”, record) con.commit() con.close()
12.4 The search engine is a simple server program that iterates indefinitely and serves a user search request in every iteration:
def freqSearch(webdb): ‘‘‘‘webdb is a database file containing table Keywords; freqSearch is a simple search engine that takes a keyword from the user and prints URLs of web pages containing it in decreasing order of frequency of the word’’’ con = sqlite3.connect(webdb) cur = con.cursor() while True: # serve forever keyword = input(”Enter keyword: “) # select web pages containing keyword in # decreasing order of keyword frequency cur.execute(”””SELECT Url, Freq FROM Keywords WHERE Word = ? ORDER BY Freq DESC”””, (keyword,)) print(‘{:15}{:4}’.format(‘URL’, ‘FREQ’)) for url, freq in cur: print(‘{:15}{:4}’.format(url, freq))
12.5 The list comprehension constructs are:
12.6 The map function should map a word (string) to a list of tuples (c, word) for every character c of word.
def getChars(word): ‘‘‘word is a string; the function returns a list of tuples (c, word) for every character c of word’’’ return [(c, word) for c in word]
The input to the reduce function is a tuple (c, lst) where lst contains words containing c; the reduce function should simply eliminate duplicates from lst:
def getCharIndex(keyVal): ‘‘‘keyVal is a 2-tuple (c, lst) where lst is a list of words (strings) function returns (c, lst') where lst' is lst with duplicates removed’’’ return (keyVal[0], list(set(keyVal[1])))
Module: notParallel.py
1 from os import getpid 2 3 def length(word): 4 ‘returns length of string word’ 5 print(‘Process {} handling {}’.format(getpid(), word)) 6 return len(word) 7 8 animals = [‘hawk’, ‘hen’, ‘hog’, ‘hyena’] 9 print([length(x) for x in animals])
It will, of course, use only one process when executed.
12.8 The map function, which we name divisors(), takes number and returns a list of pairs (i, number) for every prime i dividing number:
from math import sqrt def divisors(number): ‘‘‘returns list of (i, number) tuples for every prime i dividing number’’’ res = [] # accumulator of factors of number n = number i = 2 while n > 1: if n%i == 0: # if i is a factor of n # collect i and repeatedly divide n by i # while i is a factor of n res.append((i, number)) while n%i == 0: n //= i i += 1 # go to next i return res
The Partition step will pull together all pairs (i, number) that have the same key i. The list it constructs is actually the desired final list so the Reduce step should only copy the (key, value) pairs:
def identity(keyVal): return keyVal
Here is a test program:
from random import sample from time import time if _ _name_ _ == ‘_ _main_ _’: # create list of 64 large random integers numbers = sample(range(10000000, 20000000), 64) t1 = time() SeqMapReduce(divisors, identity).process(numbers) t2 = time() MapReduce(divisors, identity).process(numbers) t3 = time() print(‘Sequential: {:5.2f} seconds.’.format(t2-t1)) print(‘Parallel: {:5.2f} seconds.’.format(t3-t2))
When you run this test on a computer with a multicore microprocessor, you should see the parallel MapReduce implementation run faster. Here is the result for a sample run using four cores:
Sequential: 26.77 seconds. Parallel: 11.18 seconds.
The speedup is 2.39.
12.9 Write SQL queries on tables Hyperlinks and Keywords from Figure 12.2 that return these results:
12.10 Write SQL queries on table WeatherData in Figure 12.16 that return:
12.11 Using module sqlite3, create a database file weather.db and table WeatherData in it. Define the column names and types to match those in the table in Figure 12.16, then enter all the rows shown into the table.
12.12 Using sqlite3 and within the interactive shell, open the database file weather.db you created in Problem 12.11 and execute the queries from Problem 12.10 by running appropriate Python statements.
12.13 Let list lst be defined as
>>> lst = [23, 12, 3, 17, 21, 14, 6, 4, 9, 20, 19]
Write list comprehension expression based on list lst that produce these lists:
12.14 Run program primeDensity.py with one, two, three, and four cores, or up to as many cores as you have on your computer, and record the running times. Then write a sequential version of the primeDensity.py program (using list comprehension, say) and record its running time. Compute the speedup for each execution of primeDensity.py with two or more cores.
12.15 Fine-tune the run time analysis of program properNames.py by recording the execution time of each step—Map, Partition, and Reduce—of MapReduce. (You will have to modify the class MapReduce to do this.) Which steps have better speedup than others?
12.16 Write function ranking() that takes as input the name of a database file containing a table named Hyperlinks of the same format as the table in Figure 12.2(a). The function should add to the database a new table that contains the number of incoming hyperlinks for every URL listed in the Link column of Hyperlinks. Name the new table and its columns Ranks, Url, and Rank, respectively. When executed against database file links.db, the wildcard query on the Rank table should produce this output:
>>> cur.execute(‘SELECT * FROM Ranks’) <sqlite3.Cursor object at 0x15d2560> >>> for record in cur: print(record) (‘five.html’, 1) (‘four.html’, 3) (‘one.html’, 1) (‘three.html’, 1) (‘two.html’, 2)
12.17 Develop an application that takes the name of a text file as input, computes the frequency of every word in the file, and stores the resulting (word, frequency) pairs in a new table named Wordcounts of a new database file. The table should have columns Word and Freq for storing the (word, frequency) pairs.
12.18 Develop an application that displays, using Turtle graphics, the n most frequently occurring words in a text file. Assume that the word frequencies of the file have already been computed and stored in a database file such as the one created in Problem 12.17. Your application takes as input the name of this database file and the number n. It should then display the n most frequent words at random positions of a turtle screen. Try using different font sizes for the words: a very large font for the most frequently occurring word, a smaller font for the next two words, an even smaller font for the next four words, and so on.
12.19 In Practice Problem 12.4, we developed a simple search engine that ranks web pages based on word frequency. There are several reasons why that is a poor way to rank web pages, including the fact that it can be easily manipulated.
Modern search engines such as Google's use hyperlink information (among other things) to rank web pages. For example, if a web page has few incoming links, it probably does not contain useful information. If, however, a web page has many incoming hyperlinks, then it likely contains useful information and should be ranked high.
Using the links.db database file obtained by crawling through the pages in Figure 12.1, and also the Rank table computed in Problem 12.16, redevelop the search engine from Practice Problem 12.4 so it ranks web pages by number of incoming links.
>>> search2(‘links.db’) Enter keyword: Paris URL RANK four.html 3 two.html 2 one.html 1 Enter keyword:
12.20 The UNIX text search utility grep takes a text file and a regular expression and returns a list of lines in the text that contain a string that matches the pattern. Develop a parallel version of grep that takes from the user the name of a text file and the regular expression and then uses a pool of processes to search the lines of the file.
12.21 We used the program primeDensity.py to compare the densities of prime numbers in several large ranges of very large integers. In this problem, you will compare the densities of twin primes. Twin primes are pairs of primes whose difference is 2. The first few twin primes are 3 and 5, 5 and 7, 11 and 13, 17 and 19, and 29 and 32. Write an application that uses all the cores on your computer to compare the number of twin primes across the same ranges of integers we used in primeDensity.py.
12.22 Problem 10.25 asks you to develop function anagram() that uses a dictionary (i.e., a list of words) to compute all the anagrams of a given string. Develop panagram(), a parallel version of this function, that takes a list of words and computes a list of anagrams for each word.
12.23 At the end of this book there is an index, which maps words to page numbers of pages containing the words. A line index is similar: It maps words to line numbers of text lines in which they appear. Develop, using the MapReduce framework, an application that takes as input the name of a text file and also a set of words, and creates a line index. Your application should output the index to a file so words appear in alphabetical order, one word per line; the line numbers, for each word, should follow the word and be output in increasing order.
12.24 Redo the Problem 12.16 using MapReduce to compute the number of incoming links for every web page.
12.25 A web-link graph is a description of the hyperlink structure of a set of linked web pages. One way to represent the web-link graph is with a list of (url, linksList) pairs with each pair corresponding to a web page; url refers to the URL of the page, and linksList is a list of URLs of hyperlinks contained in the page. Note that this information is easily collected by a web crawler.
The reverse web-link graph is another representation of the hyperlink structure of the set of web pages. It can be represented as a list of (url, incomingList) pairs with url referring to the URL of a web page and incomingList referring to a list of URLs of incoming hyperlinks. So the reverse web-link graph makes explicit incoming links rather than outgoing links. It is very useful for efficiently computing the Google PageRank of web pages.
Develop a function that takes a web-link graph, represented as described, and returns the reverse web-link graph.
12.26 A web server usually creates a log for every HTTP request it handles and appends the log string to a log file. Keeping a log file is useful for a variety of reasons. One particular reason is that it can be used to learn what resources—identified by URLs—managed by the server have been accessed and how often—referred to as the URL access frequency. In this problem, you will develop a program that computes the URL access frequency from a given log file.
Web server log entries are written in a well-known, standard format known as the Common Log Format. This is a standard format used by the Apache httpd web server as well as other servers. A standard format makes it possible to develop log analysis programs that mine the access log file. A log file entry produced in a common log format looks like this:
127.0.0.1 - - [16/Mar/2010:11:52:54 -0600] “GET/index.html HTTP/1.0” 200 1929
This log contains a lot of information. The key information, for our purposes, is the requested resource, index.html.
Write a program that computes the access frequency for each resource appearing in the log file and writes the information into a database table with columns for the resource URL and the access frequency. Writing the access frequency into a database makes the URL access frequency amenable to queries and analysis.
12.27 Write an application that computes a concordance of a set of novels using MapReduce. A concordance is a mapping that maps each word in a set of words to a list of sentences from the novels that contain the word. The input for the application is the set of names of text files containing the novels and the set of words to be mapped. You should output the concordance to a file.