Q&A

Q. When should I call the _Node constructor?

A. Just as with any other class, you should call the _Node constructor when you want to create a new _Node object (a new node in the linked list). You should not use it to create a new reference to an existing _Node object. For example, the code

oldfirst = _Node(item, next)
oldfirst = first

creates a new _Node object, then immediately loses track of the only reference to it. This code does not result in an error, but it is a bit untidy to create orphans for no reason.

Q. Why not define Node as a stand-alone class in a separate file named node.py?

A. By defining _Node in the same file as linkedstack.py or linkedqueue.py and giving it a name that begins with an underscore, we encourage clients of the Stack or Queue classes not to use the _Node class directly. Our intention is that the _Node objects be used only in the linkedstack.py or linkedqueue.py implementations, not in other clients.

Q. Should a client be allowed to insert the item None onto a stack or queue?

A. This question arises frequently when implementing collections in Python. Our implementations do permit the insertion of any object, including None.

Q. Are there standard Python modules for stacks and queues?

A. Not really. As noted earlier in this section, Python’s built-in list data type has operations that make it is easy to efficiently implement a stack using a list. But the list data type also needs many additional methods that are not normally associated with a stack, such as indexed access and deleting an arbitrary item. The prime advantage of restricting ourselves to the set of operations that we need (and only those operations) is that it makes it easier to develop an implementation that can provide the best possible performance guarantees for those operations. Python also includes a data type collections.deque that implements a mutable sequence of items with efficient insertion and deletion to either the front or the back.

Q. Why not have a single data type that implements methods to insert an item, remove the most recently inserted item, remove the least recently inserted item, remove a random item, iterate over the items, return the number of items in the collection, and whatever other operations we might desire? Then we could get them all implemented in a single class that could be used by many clients.

A. This is an example of a wide interface, which, as we pointed out in SECTION 3.3, should be avoided. As just mentioned, one reason to avoid wide interfaces is that it is difficult to construct implementations that are efficient for all operations. A more important reason is that narrow interfaces enforce a certain discipline on your programs, which makes client code much easier to understand. If one client uses Stack and another uses Queue, we have a good idea that the LIFO discipline is important to the first and the FIFO discipline is important to the second. Another approach is to use inheritance to try to encapsulate operations that are common to all collections. However, such implementations are best left for experts, whereas any programmer can learn to build implementations such as Stack and Queue.

Q. Is there any way that I can compose a client that uses both arraystack.py and linkedstack.py in the same program?

A. Yes, the easiest way is to add an as clause to the import statement, as below. In effect, this kind of import statement creates an alias for the name of the class and your code can then use that alias instead of the name of the class.

from arraystack  import Stack as ArrayStack
from linkedstack import Stack as LinkedStack
...
stack1 = ArrayStack()
stack2 = LinkedStack()

Exercises

4.3.1 Give the output written by python arraystack.py for the following input:

it was - the best - of times - - - it was - the - -

4.3.2 Give the contents and length of the array for arraystack.py after each operation for the following input:

it was - the best - of times - - - it was - the - -

4.3.3 Suppose that a client performs an intermixed sequence of push and pop operations on a Stack. The push operations put the integers 0 through 9 in order onto the stack; the pop operations write the return value. Which of the following sequence(s) could not occur?

a. 4 3 2 1 0 9 8 7 6 5
b. 4 6 8 7 5 3 2 9 0 1
c. 2 5 6 7 4 8 9 3 1 0
d. 4 3 2 1 0 5 6 7 8 9
e. 1 2 3 4 5 6 9 8 7 0
f. 0 4 6 5 3 8 1 7 2 9
g. 1 4 7 9 8 6 5 3 0 2
h. 2 1 4 3 6 5 8 7 9 0

4.3.4 Compose a stack client reverse.py that reads in strings from standard input and writes them in reverse order to standard output.

4.3.5 Compose a stack client parentheses.py that reads in a text stream from standard input and uses a stack to determine whether its parentheses are properly balanced. For example, your program should write True for [()]{}{[()()]()} and False for [(]).

4.3.6 Add the method __len__() to the Stack class in linkedstack.py (PROGRAM 4.3.2).

4.3.7 Add a method peek() to the Stack class in arraystack.py that returns the most recently inserted item on the stack (without popping it).

4.3.8 What does the following code fragment write when n is 50? Give a high-level description of what the code fragment does for a given positive integer n.

stack = Stack()
while n > 0:
    stack.push(n % 2)
    n /= 2
while not stack.isEmpty():
    stdio.write(stack.pop())
stdio.writeln()

Solution: It writes the binary representation of n (110010 when n is 50).

4.3.9 What does the following code fragment do to the queue queue?

stack = Stack()
while not queue.isEmpty(): stack.push(queue.dequeue())
while not stack.isEmpty(): queue.enqueue(stack.pop())

4.3.10 Draw an object-level trace diagram for the three-node example used to introduce linked lists in this section.

4.3.11 Compose a program that takes from standard input an expression without left parentheses and writes the equivalent infix expression with the parentheses inserted. For example, given the input

1 + 2 ) * 3 - 4 ) * 5 - 6 ) ) )

your program should write

( ( 1 + 2 ) * ( ( 3 - 4 ) * ( 5 - 6 ) )

4.3.12 Compose a filter infixtopostfix.py that converts a fully parenthesized arithmetic expression from infix to postfix.

4.3.13 Compose a program evaluatepostfix.py that reads a postfix expression from standard input, evaluates it, and writes the value to standard output. (Piping the output of your program from the previous exercise to this program gives equivalent behavior to evaluate.py.)

4.3.14 Suppose that a client performs an intermixed sequence of enqueue and dequeue operations on a Queue. The enqueue operations put the integers 0 through 9 in order onto the queue; the dequeue operations write the return value. Which of the following sequence(s) could not occur?

a. 0 1 2 3 4 5 6 7 8 9
b. 4 6 8 7 5 3 2 9 0 1
c. 2 5 6 7 4 8 9 3 1 0
d. 4 3 2 1 0 5 6 7 8 9

4.3.15 Compose a Queue client that takes a command-line argument k writes the kth from the last string found on standard input.

4.3.16 Give the running time of each operation in the following Queue class, where the item least recently inserted is at _a[0].

class Queue:
    def __init__(self):      self._a = []
    def isEmpty(self):       return len(self._a) == 0
    def __len__(self):       return len(self._a)
    def enqueue(self, item): self._a += [item]
    def dequeue(self):       return self._a.pop(0)

4.3.17 Give the running time of each operation in the following Queue class, where the item most recently inserted is at _a[0].

class Queue:
    def __init__(self):      self._a = []
    def isEmpty(self):       return len(self._a) == 0
    def __len__(self):       return len(self._a)
    def enqueue(self, item): self._a.insert(0, item)
    def dequeue(self):       return self._a.pop()

4.3.18 Modify mm1queue.py to make a program md1queue.py that simulates a queue for which the service times are fixed (deterministic) at rate of μ. Verify Little’s law empirically for this model.

Linked-List Exercises

The following exercises are intended to give you experience in working with linked lists. The easiest way to work them is to make drawings using the visual representation described in the text.

4.3.19 Suppose x is a linked-list node. What is the effect of the following code fragment?

x.next = x.next.next

Solution: Deletes from the list the node immediately following x.

4.3.20 Compose a function find() that takes the first node in a linked list and an object key as arguments and returns True if some node in the list has key as its item field, and False otherwise.

4.3.21 Compose a function delete() that takes the first node in a linked list and an integer k as arguments and deletes the kth element in a linked list, if it exists.

4.3.22 Suppose that x is a linked-list node. What is the effect of the following code fragment?

t.next = x.next
x.next = t

Solution: Inserts node t immediately after node x.

4.3.23 Why does the following code fragment not have the same effect as the code fragment in the previous question?

x.next = t
t.next = x.next

Solution: When it comes time to update t.next, x.next is no longer the original node following x, but is instead t itself!

4.3.24 Compose a function removeAfter() that takes a linked-list node as an argument and removes the node following the given one (and does nothing if the argument or the next field in the argument node is None).

4.3.25 Compose a function copy() that takes a linked-list node as an argument and creates a new linked list with the same sequence of items, without destroying the original linked list.

4.3.26 Compose a function remove() that takes a linked-list node and an object item as arguments and removes every node in the list whose item is item.

4.3.27 Compose a function listmax() that takes the first node in a linked list as an argument and returns the value of the maximum item in the list. Assume that the items are comparable, and return None if the list is empty.

4.3.28 Develop a recursive solution to the previous question.

4.3.29 Compose a function that takes the first node in a linked list as an argument and reverses the list, returning the first node in the result.

Iterative solution: To accomplish this task, we maintain references to three consecutive nodes in the linked list: reverse, first, and second. At each iteration, we extract the node first from the original linked list and insert it at the beginning of the reversed list. We maintain the invariant that first is the first node of what’s left of the original list, second is the second node of what’s left of the original list, and reverse is the first node of the resulting reversed list.

def reverse(first):
    reverse = None
    while first is not None:
        second = first.next
        first.next = reverse
        reverse = first
        first = second
    return reverse

When composing code involving linked lists, we must always be careful to properly handle the exceptional cases (when the linked list is empty, when the list has only one or two nodes) and the boundary cases (dealing with the first or last items). This is usually much trickier than handling the normal cases.

4.3.30 Compose a recursive function to write the elements of a linked list in reverse order. Do not modify any of the links. Easy: Use quadratic time, constant extra space. Also easy: Use linear time, linear extra space. Not so easy: Develop a divide-and-conquer algorithm that uses linearithmic time and logarithmic extra space.

Quadratic time, constant space solution: We recursively reverse the part of the list starting at the second node, and then carefully append the first element to the end.

def reverse(first):
    if first is None:
        return None
    if first.next is None:
        return first
    second = first.next
    rest = reverse(second)
    second.next = first
    first.next = None
    return rest

4.3.31 Compose a recursive function to randomly shuffle the elements of a linked list by modifying the links. Easy: Use quadratic time, constant extra space. Not so easy: Develop a divide-and-conquer algorithm that takes linearithmic time and uses logarithmic extra memory. For the “merging” step, see EXERCISE 1.4.38.

Creative Exercises

4.3.32 Deque. A double-ended queue or deque (pronounced “deck”) is a combination of a stack and a queue. Compose a class Deque that uses a linked list to implement this API:

Image

4.3.33 Josephus problem. In the Josephus problem from antiquity, n people are in dire straits and agree to the following strategy to reduce the population. They arrange themselves in a circle (at positions numbered from 0 to n) and proceed around the circle, eliminating every m person until only one person is left. Legend has it that Josephus figured out where to sit to avoid being eliminated. Compose a Queue client josephus.py that takes n and m from the command line and writes the order in which people are eliminated (and thus would show Josephus where to sit in the circle).

% python josephus.py 7 2
1 3 5 0 4 2 6

4.3.34 Merging two sorted queues. Given two queues with strings in ascending order, move all of the strings to a third queue so that the third queue ends up with the strings in ascending order.

4.3.35 Nonrecursive mergesort. Given n strings, create n queues, each containing one of the strings. Create a queue of the n queues. Then, repeatedly apply the sorted merging operation to the first two queues and reinsert the merged queue at the end. Repeat until the queue of queues contains only one queue.

4.3.36 Delete ith element. Implement a class that supports the following API:

Image

First, develop an implementation that uses a Python list (resizing array) implementation, and then develop one that uses a linked-list implementation. (See EXERCISE 4.4.69 for a more efficient implementation that uses a binary search tree.)

4.3.37 Queue with two stacks. Show how to implement a queue using two stacks (and only a constant amount of extra memory) so that each queue operations uses a constant amortized number of stack operations.

4.3.38 Ring buffer. A ring buffer, or circular queue, is a FIFO data structure of a fixed capacity n. It is useful for transferring data between asynchronous processes or for storing log files. When the buffer is empty, the consumer waits until data is deposited; when the buffer is full, the producer waits to deposit data. Develop an API for a ring buffer and an implementation that uses an array representation (with circular wrap-around).

4.3.39 Move-to-front. Read in a sequence of characters from standard input and maintain the characters in a linked list with no duplicates. When you read in a previously unseen character, insert it at the front of the list. When you read in a duplicate character, delete it from the list and reinsert it at the beginning. Name your program MoveToFront: it implements the well known move-to-front strategy, which is useful for caching, data compression, and many other applications where items that have been recently accessed are more likely to be reaccessed.

4.3.40 Random queue. A random queue stores a collection of items as per this API:

Image

Compose a class RandomQueue that implements this API. Hint: Use a Python list (resizing array) representation, as in PROGRAM 4.3.1. To remove an item, swap one at a random position (indexed 0 through n-1) with the one at the last position (index n-1). Then delete and return the last object. Compose a client that writes a deck of cards in random order using RandomQueue.

4.3.41 Topological sort. You have to sequence the order of n jobs that are numbered from 0 to n-1 on a server. Some of the jobs must complete before others can begin. Compose a program topologicalsorter.py that takes n as a command-line argument and a sequence on standard input of ordered pairs of jobs i j, and then writes a sequence of integers such that for each pair i j in the input, job i appears before job j. Use the following algorithm: First, from the input, build, for each job, (1) a queue of the jobs that must follow it and (2) its indegree (the number of jobs that must come before it). Then, build a queue of all nodes whose indegree is 0 and repeatedly delete some job with zero indegree, maintaining all the data structures. This process has many applications; for example, you can use it to model course prerequisites for your major so that you can find a sequence of courses to take so that you can graduate.

4.3.42 Text editor buffer. Develop a data type for a buffer in a text editor that implements the following API:

Image

Hint: Use two stacks.

4.3.43 Copy a stack. Create a copy() method for the linked-list implementation of Stack so that

stack2 = stack1.copy()

makes stack2 a reference to a new and independent copy of the stack stack1. You should be able to push and pop from either stack1 or stack2 without influencing the other.

4.3.44 Copy a queue. Create a copy() method for the linked-list implementation of Queue so that

queue2 = queue1.copy()

makes queue2 a reference to a new and independent copy of the queue queue1. Hint: Delete all of the items from queue1 and add these items to both queue1 and queue2.

4.3.45 Stack with explicit resizing array. Implement a stack using an explicit resizing array: initialize an empty stack by using an array of length 1 as an instance variable; double the length of the array when it becomes full and halve the length of the array when it becomes one-fourth full.

Solution:

class Stack:

    def __init__(self):
        self._a = [None]
        self._n = 0

    def isEmpty(self):
        return self._n == 0

    def __len__(self):
        return self._n

    def _resize(self, capacity):
        temp = stdarray.create1D(capacity)
        for i in range(self._n):
            temp[i] = self._a[i]
        self._a = temp

    def push(self, item):
        if self._n == len(self._a):
            self._resize(2 * self._n)
        self._a[self._n] = item
        self._n += 1

    def pop(self):
        self._n -= 1
        item = self._a[self._n]
        self._a[self._n] = None
        if (self._n > 0) and (self._n == len(self._a) // 4):
            self._resize(self._n // 2)
        return item

4.3.46 Queue with explicit resizing array. Implement a queue using an explicit resizing array so that all operations take constant amortized time. Hint: The challenge is that the items will “crawl across” the array as items are added to and removed from the queue. Use modular arithmetic to maintain the array indices of the items at the front and back of the queue.

Image

4.3.47 Queue simulations. Study what happens when you modify mm1queue.py to use a stack instead of a queue. Does Little’s law hold? Answer the same question for a random queue. Plot histograms and compare the standard deviations of the waiting times.

4.3.48 Load-balancing simulations. Revise loadbalance.py to write the average queue length and the maximum queue length instead of plotting the histogram, and use it to run simulations for 1 million items on 100,000 queues. Write the average value of the maximum queue length for 100 trials each with sample sizes 1, 2, 3, and 4. Do your experiments validate the conclusion drawn in the text about using a sample of size 2?

4.3.49 Listing files. A folder is a list of files and subfolders. Compose a program that takes the name of a folder as a command-line argument and writes all of the file names contained in that folder, with the contents of each folder recursively listed (indented) under that folder’s name. Hint: Use a queue, and see the listdir() function defined in Python’s os module.

4.4 Symbol Tables

A symbol table is a data type that we use to associate values with keys. Clients can store (put) an entry into the symbol table by specifying a key–value pair and then can retrieve (get) the value corresponding to a particular key from the symbol table. For example, a university might associate information such as a student’s name, home address, and grades (the value) with that student’s Social Security number (the key), so that each student’s record can be accessed by specifying a Social Security number. The same approach might be appropriate for a scientist who needs to organize data, a business that needs to keep track of customer transactions, an Internet search engine that has to associate keywords with web pages, or in countless other ways.

Because of their fundamental importance, symbol tables have been heavily used and studied since the early days of computing. The development of implementations that guarantee good performance continues to be an active area of research. Several other operations on symbol tables beyond the characteristic put and get operations naturally arise, and guaranteeing good performance for richer sets of operations can be quite challenging.

In this chapter we consider a basic API for the symbol-table data type. Our API adds to the put and get operations the abilities to test whether any value has been associated with a given key (contains) and to iterate over the keys. We also consider an extension to the API for the case where keys are comparable, which admits a number of useful operations.

We also consider two classic implementations. The first uses an operation known as hashing, which transforms keys into array indices that we can use to access values. The second is based on a data structure known as the binary search tree (BST). Both are remarkably simple solutions that perform well in many practical situations and also serve as the basis for the industrial-strength symbol-table implementations that are found in modern programming environments. The code that we consider for symbol tables is only slightly more complicated than the resizing array and linked-list code that we considered for stacks and queues, but it will introduce you to new dimensions in structuring data that have far-reaching impact.

API

A symbol table is a collection of key–value pairs—every symbol-table entry associates a value with a key, as follows:

Image

In this section, we discuss this API, clients, implementations, and extensions. The API is consistent with the API for Python’s built-in dict data type, which we consider later in this section. The API already reflects several design decisions, which we now enumerate.

Associative arrays

We overload the [] operator for the two basic operations put and get. In client code this means that we can think of a symbol table as an associative array, where we can use standard array syntax with any type of data inside the square brackets instead of an integer between 0 and the length, as for an array. Thus, we can associate a codon with an amino acid name with client code like

amino['TTA'] = 'Leucine'

and we can later access the name associated with a given codon with client code like

stdio.writeln(amino['TTA'])

That is, an associative array reference is a get operation, unless it is on the left side of an assignment statement, when it is a put operation. We can support these operations by implementing the special methods __getitem__() and __setitem__(). Thinking in terms of associative arrays is a good way to understand the basic purpose of symbol tables.

Replace-the-old-value policy

If a value is to be associated with a key that already has an associated value, we adopt the convention that the new value replaces the old one (just as with an array assignment statement). Again, this is what one would expect from the associative-array abstraction. The key in st operation, supported by the special method __contains__(), gives the client the flexibility to avoid doing so, if desired.

Not found

The call st[key] raises a KeyError if no value has been associated with key in the table. An alternative design would be to return None in such cases.

None keys and values

Clients may use None as a key or value, though they typically do not do so. An alternative design would be to disallow either None keys and/or values.

Iterable

To support the for key in st: construct, Python’s convention is that we need to implement a special method __iter__() that returns an iterator, a special data type that includes methods that are called at the beginning and for each iteration of the for loop. We consider Python’s mechanism for iteration at the end of this section.

Remove

Our basic API does not include a method for removing keys from the symbol table. Some applications do require such a method, and Python provides the special syntax del st[key] that can be supported by implementing the special method __delitem__(). We leave implementations as exercises or for a more advanced course in algorithms.

Immutable keys

We assume the keys do not change their value while in the symbol table. The simplest and most commonly used types of keys (integers, floats, and strings) are immutable. If you think about it, you will see that is a very reasonable assumption! If a client changes a key, how could the symbol table implementation keep track of that fact?

Variations

Computer scientists have identified numerous other useful operations on symbol tables, and APIs based on various subsets of them have been widely studied. We consider several of these operations throughout this section, and particularly in the exercises at the end.

Comparable keys

In many applications, the keys may be integers, floats, strings, or other data types of data that have a natural order. In Python, as discussed in SECTION 3.3, we expect such keys to be comparable. Symbol tables with comparable keys are important for two reasons. First, we can take advantage of key ordering to develop implementations of put and get that can guarantee the performance specifications in the API. Second, a whole host of new operations come to mind (and can be supported) with comparable keys. A client might want the smallest key, or the largest, or the median, or to iterate over the keys in sorted order. Full coverage of this topic is more appropriate for a book on algorithms and data structures, but we examine a typical client and an implementation of such a data type later in this section. A partial API is shown at the bottom of this page.

Symbol tables are among the most widely studied data structures in computer science, so the impact of these and many alternative design decisions has been carefully studied, as you will learn if you take later courses in computer science. In this section, our approach is to introduce you to the most important properties of symbol tables by considering two prototypical client programs, developing two classic and efficient implementations, and studying the performance characteristics of those implementations, to show you that they can meet the needs of typical clients, even when the symbol tables are huge.

Image

Symbol table clients

Once you gain some experience with the idea, you will realize that symbol tables are broadly useful. To convince you of this fact, we start with two prototypical examples, each of which arises in a large number of important and familiar practical applications.

Dictionary lookup

The most basic kind of symbol-table client builds a symbol table with successive put operations to support get requests. We maintain a collection of data so that we can quickly access the data we need. Most applications also take advantage of the idea that a symbol table is a dynamic dictionary, where it is easy both to look up information and to update the information in the table. The following list of familiar examples illustrates the utility of this approach.

Phone book. When keys are people’s names and values are their phone numbers, a symbol table models a phone book. A very significant difference from a printed phone book is that we can add new names or change existing phone numbers. We could also use the phone number as the key and the name as the value. If you have never done so, try typing your phone number (with area code) into the search field in your browser.

Dictionary. Associating a word with its definition is a familiar concept that gives us the name “dictionary.” For centuries people kept printed dictionaries in their homes and offices so that they could check the definitions and spellings (values) of words (keys). Now, because of good symbol-table implementations, people expect built-in spell checkers and immediate access to word definitions on their computers.

Account information. People who own stock now regularly check the current price on the web. Several services on the web associate a ticker symbol (key) with the current price (value), usually along with a great deal of other information. Commercial applications of this sort abound, including financial institutions associating account information with a name or account number or educational institutions associating grades with a student name or identification number.

Image

Genomics. Symbols play a central role in modern genomics, as we have already seen (see PROGRAM 3.1.1). The simplest example is the use of the letters A, C, T, and G to represent the nucleotides found in the DNA of living organisms. The next simplest is the correspondence between codons (nucleotide triplets) and amino acids (TTA corresponds to leucine, TCT to cycstine, and so forth), then the correspondence between sequences of amino acids and proteins, and so forth. Researchers in genomics routinely use various types of symbol tables to organize this knowledge.

Experimental data. From astrophysics to zoology, modern scientists are awash in experimental data, and organizing and efficiently accessing this data is vital to understanding what it means. Symbol tables are a critical starting point, and advanced data structures and algorithms that are based on symbol tables are now an important part of scientific research.

Programming languages. One of the earliest uses of symbol tables was to organize information for programming. At first, programs were simply sequences of numbers, but programmers very quickly found that using symbolic names for operations and memory locations (variable names) was far more convenient. Associating the names with the numbers requires a symbol table. As the size of programs grew, the cost of the symbol-table operations became a bottleneck in program development time, which led to the development of data structures and algorithms like the ones we consider in this section.

Files. We use symbol tables regularly to organize data on computer systems. Perhaps the most prominent example is the file system, where we associate a file name (key) with the location of its contents (value). Your music player uses the same system to associate song titles (keys) with the location of the music itself (value).

Internet DNS. The domain name system (DNS) that is the basis for organizing information on the Internet associates URLs (keys) that humans understand (such as www.princeton.edu or www.wikipedia.org) with IP addresses (values) that computer network routers understand (such as 208.216.181.15 or 207.142.131.206). This system is the next-generation “phone book.” Thus, humans can use names that are easy to remember and machines can efficiently process the numbers. The number of symbol-table lookups done each second for this purpose on Internet routers around the world is huge, so performance is of obvious importance. Millions of new computers and other devices are put onto the Internet each year, so these symbol tables on Internet routers need to be dynamic.

% more amino.csv
TTT,Phe,F,Phenylalanine
TTC,Phe,F,Phenylalanine
TTA,Leu,L,Leucine
TTG,Leu,L,Leucine
TCT,Ser,S,Serine
TCC,Ser,S,Serine
TCA,Ser,S,Serine
TCG,Ser,S,Serine
TAT,Tyr,Y,Tyrosine
TAC,Tyr,Y,Tyrosine
TAA,Stop,Stop,Stop
...
GCA,Ala,A,Alanine
GCG,Ala,A,Alanine
GAT,Asp,D,Aspartic Acid
GAC,Asp,D,Aspartic Acid
GAA,Gly,G,Glutamic Acid
GAG,Gly,G,Glutamic Acid
GGT,Gly,G,Glycine
GGC,Gly,G,Glycine
GGA,Gly,G,Glycine
GGG,Gly,G,Glycine
% more djia.csv
...
20-Oct-87,1738.74,608099968,1841.01
19-Oct-87,2164.16,604300032,1738.74
16-Oct-87,2355.09,338500000,2246.73
15-Oct-87,2412.70,263200000,2355.09
...
30-Oct-29,230.98,10730000,258.47
29-Oct-29,252.38,16410000,230.07
28-Oct-29,295.18,9210000,260.64
25-Oct-29,299.47,5920000,301.22
...
% more ip.csv
...
www.ebay.com,66.135.192.87
www.princeton.edu,128.112.128.15
www.cs.princeton.edu,128.112.136.35
www.harvard.edu,128.103.60.24
www.yale.edu,130.132.51.8
www.cnn.com,64.236.16.20
www.google.com,216.239.41.99
www.nytimes.com,199.239.136.200
www.apple.com,17.112.152.32
www.slashdot.org,66.35.250.151
www.espn.com,199.181.135.201
www.weather.com,63.111.66.11
www.yahoo.com,216.109.118.65
...

Typical comma-separated-value (CSV) files

Despite its scope, this list is still just a representative sample, intended to give you a flavor of the scope of applicability of the symbol-table abstraction. Whenever you specify something by name, there is a symbol table at work. Your computer’s file system or the web might do the work for you, but there is still a symbol table there somewhere.

PROGRAM 4.4.1 (lookup.py) builds a set of key–value pairs from a file of comma-separated values (see SECTION 3.1) as specified on the command line and then writes values corresponding to the keys read from standard input. The command-line arguments are the file name and two integers, one specifying the field to serve as the key and the other specifying the field to serve as the value. Examples of similar but slightly more sophisticated test clients are described in the exercises. For instance, we could make the dictionary dynamic by also allowing standard-input commands to change the value associated with a key (see EXERCISE 4.4.1).

Your first step in understanding symbol tables is to download lookup.py and hashst.py (the symbol-table implementation that we consider next) from the booksite to do some symbol-table searches. You can find numerous comma-separated-value (.csv) files that are related to various applications that we have described, including amino.csv (codon-to-amino-acid encodings), djia.csv (opening price, volume, and closing price of the stock market average, for every day in its history), and ip.csv (a selection of entries from the DNS database). When choosing which field to use as the key, remember that each key must uniquely determine a value. If there are multiple put operations to associate values with a key, the table will remember only the most recent one (think about associative arrays). We will consider next the case where we want to associate multiple values with a key.


Program 4.4.1 Dictionary lookup (lookup.py)

import sys
import stdio
from instream import InStream
from hashst import SymbolTable

instream = InStream(sys.argv[1])
keyField = int(sys.argv[2])
valField = int(sys.argv[3])

database = instream.readAllLines()

st = SymbolTable()
for line in database:
    tokens = line.split(',')
    key = tokens[keyField]
    val = tokens[valField]
    st[key] = val

while not stdio.isEmpty():
    query = stdio.readString()
    if query in st: stdio.writeln(st[query])
    else:           stdio.writeln('Not found')


 instream  | input stream (.csv)
 keyField  | key position
 valField  | value position
database[] | lines in input
    st     | symbol table
  tokens   | values on a line
   key     | key
   val     | value
  query    | query string

This data-driven symbol table client reads key–value pairs from a comma-separated file, then writes values corresponding to keys on standard input. Both keys and values are strings.


% python lookup.py amino.csv 0 3
TTA
Leucine
ABC
Not found
TCT
Serine


% python lookup.py amino.csv 3 0
Glycine
GGG


% python lookup.py ip.csv 0 1
www.google.com
216.239.41.99

% python lookup.py ip.csv 1 0
216.239.41.99
www.google.com

% python lookup.py djia.csv 0 1
29-Oct-29
252.38


Later in this chapter, we will see that the cost of associative array references in lookup.py can be linear or logarithmic. This fact implies that you may experience a small delay getting the answer to your first request (for all the put operations to build the table), but you get immediate response for all the others.

Indexing

PROGRAM 4.4.2 (index.py) is a prototypical example of a symbol-table client for comparable keys. It reads in a list of strings from standard input and writes a sorted table of all the different strings along with a list of integers for each string specifying the positions where it appears in the input. In this case, we seem to be associating multiple values with each key, but we actually associating just one: a Python list. Again, this problem has familiar applications:

Image

Book index. Every textbook has an index where you look up a word and get the page numbers containing that word. While no reader wants to see every word in the book in an index, a program like index.py can provide a starting point for creating a good index.

Programming languages. In a large program that uses a large number of symbols, it is useful to know where each name is used. A program like index.py can be a valuable tool to help programmers keep track of where symbols are used in their programs. Historically, an explicit printed symbol table was one of the most important tools used by programmers to manage large programs. In modern systems, symbol tables are the basis of software tools that programmers use to manage names of modules in systems.

Genomics. In a typical (if oversimplified) scenario in genomics research, a scientist wants to know the positions of a given genetic sequence in an existing genome or set of genomes. Existence or proximity of certain sequences may be of scientific significance. The starting point for such research is an index like the one produced by index.py, modified to take into account the fact that genomes are not separated into words.


Program 4.4.2 Indexing (index.py)

import sys
import stdio
from bst import OrderedSymbolTable

minLength = int(sys.argv[1])
minCount = int(sys.argv[2])

words = stdio.readAllStrings()

bst = OrderedSymbolTable()
for i in range(len(words)):
    word = words[i]
    if len(word) >= minLength:
        if word not in bst: bst[word] = []
        bst[word] += [i]

for word in bst:
    if len(bst[word]) >= minCount:
        stdio.write(word + ': ')
        for i in bst[word]:
            stdio.write(str(i) + ' ')
        stdio.writeln()


minLength | minimum length
minCount  | count threshold
  bst     | ordered symbol table
  word    | current word
bst[word] | array of positions for current word

This program accepts integers minLength and minCount as command-line arguments, reads all of the words from standard input and creates a sorted index indicating where each word appears within standard input. It considers only words that have at least minLength characters and it writes only words that occur at least minCount times. The computation is based on a symbol table of comparables, where each key is a word and each corresponding value is an array of positions where the word occurs in standard input.


% python index.py 9 30 < tale.txt
confidence: 2794 23064 25031 34249 47907 48268 48577 ...
courtyard: 11885 12062 17303 17451 32404 32522 38663 ...
evremonde: 86211 90791 90798 90802 90814 90822 90856 ...
expression: 3777 5575 6574 7116 7195 8509 8928 15015 ...
gentleman: 2521 5290 5337 5698 6235 6301 6326 6338 ...
influence: 27809 36881 43141 43150 48308 54049 54067 ...
monseigneur: 85 90 36587 36590 36611 36636 36643 ...
...


Web search. When you type a keyword and get a list of websites containing that keyword, you are using an index created by your web search engine. There is one value (the list of pages) associated with each key (the query), although the reality is a bit more dynamic and complicated because we often specify multiple keys and the pages are spread through the web, not kept in a table on a single computer.

Account information. One way for a company that maintains customer accounts to keep track of a day’s transactions is to keep an index of the list of the transactions. The key is the account number; the value is the list of occurrences of that account number in the transaction list.

To cut down on the amount of output, index.py takes three command-line arguments: a file name and two integers. The first integer is the minimum string length to include in the symbol table, and the second is the minimum number of occurrences (among the words that appear in the text) to include in the printed index. Again, several similar clients for various useful tasks are discussed in the exercises. For example, one common scenario is to build multiple indices on the same data, using different keys. In our account example, one index might use customer account numbers for keys and another might use vendor account numbers.

As with lookup.py, you are certainly encouraged to download index.py and bst.py (the symbol-table implementation for comparable keys that we consider later) from the booksite and run them on various input files to gain further appreciation for the utility of symbol tables for comparable keys. If you do so, you will find that the programs can build large indices for huge files with little delay, because each symbol-table operation is taken care of immediately.

One reason for the proliferation of algorithms and implementations is that the needs of symbol-table clients can vary widely. On the one hand, when the symbol table or the number of operations to be performed is small, any implementation will do. On the other hand, symbol tables for some applications are so huge that they are organized as databases that reside on external storage or the web. In this section, we focus on the huge class of clients like index.py and lookup.py whose needs fall between these extremes, where we need to be able to use associative assignments to build and maintain large tables dynamically while also responding immediately to a large number of associative lookups. Providing this immediate response for huge dynamic tables is one of the classic contributions of algorithmic technology.

Elementary symbol-table implementations

All of these examples are persuasive evidence of the importance of symbol tables. Symbol-table implementations have been heavily studied, many different algorithms and data structures have been invented for this purpose, and modern programming environments (including Python) provide direct support. As usual, knowing how a basic implementation works will help you appreciate, choose among, and more effectively use the advanced ones, or help implement your own version for some specialized situation that you might encounter.

To begin, we briefly consider four different elementary implementations, based on the two basic data structures that we have encountered: resizing arrays and linked lists. Our purpose in doing so is to establish that we need a more sophisticated data structure, as none of these achieve the performance requirements in our API. Each implementation uses linear time for one of the operations, which makes each of them unsuitable for large practical applications.

Perhaps the simplest implementation is to use sequential search with two parallel (resizing) arrays, one for keys and one for values, as follows (see EXERCISE 4.4.10 and EXERCISE 4.4.33 for implementations of __contains__() and __iter__(), respectively):

# sequential search implementation of a symbol table
class SymbolTable:

    def __init__(self):
        self._keys = []
        self._vals = []

    def __getitem__(self, key):
        for i in range(len(self._keys)):
            if self._keys[i] == key:
                return self._vals[i]
        raise KeyError

    def __setitem__(self, key, val):
        for i in range(len(self._keys)):
            if self._keys[i] == key:
                self._vals[i] = val
                return
        self._keys += [key]
        self._vals += [val]

Image

The key array is unordered—we just append new keys at the end. As usual, this is an (amortized) constant-time operation for Python lists. But a search for a key in the symbol table is typically a linear-time operation: For example, if we are looking for a key that is not in the table, we have to examine all of the keys.

To respond to the linear-time search cost, we might use a sorted (resizing) array for the keys. Actually, we already considered the idea of a dictionary when we considered binary search in SECTION 4.2. It is not difficult to build a symbol-table implementation that is based on binary search (see EXERCISE 4.4.5) but such an implementation is not feasible for use with a client like index.py, because it depends on maintaining a resizing array sorted in order of the keys. Each time a new key is added, larger keys have to be shifted one position higher in the array, which implies that the total time required to build the table is quadratic in the size of the table.

Alternatively, we might consider basing an implementation on an unordered linked list, because we can quickly add new key–value pairs at the beginning. But such an implementation is also not feasible for use by typical clients, because the only way to search for a key in a linked list is to traverse its links, so the total time required for searches is the product of the number of searches and the size of the table, which again is prohibitive. Associating a key with a value is also slow because we have to search first to avoid putting duplicate keys in the symbol table. Even keeping the linked list in sorted order does not help much—for example, traversing all the links is still required to add a new node to the end of the linked list.

To implement a symbol table that is feasible for use with clients such as lookup.py and index.py, we need data structures that are more flexible and efficient than these elementary examples. Next, we consider two examples of such data structures: the hash table and the binary search tree.

Hash tables

A hash table is a data structure in which we divide the keys into small groups that can be quickly searched. The basic idea is simple. We choose a parameter m and divide the keys into m groups, which we expect to be about equal in size. For each group, we keep the keys in a list and use sequential search, as in the elementary implementation we just considered.

To divide the keys into small groups, we use a function called a hash function that maps every possible key into a hash value—an integer between 0 and m – 1. This enables us to model the symbol table as a fixed-length array of lists and use the hash value as an array index to access the desired list. In Python, we can implement both the fixed-length array and the lists using the built-in list data type.

Hashing is widely useful, so many programming languages include direct support for it. As we saw in SECTION 3.3, Python provides the built-in hash() function for this purpose, which takes a hashable object as an argument returns an integer hash code. To convert that to a hash value between 0 and m – 1, we use the expression

hash(x) % m

Recall that an object is hashable if it satisfies the following three properties:

• The object can be compared for equality with other objects.

• Whenever two objects compare as equal, they have the same hash code.

• The object’s hash code does not change during its lifetime.

Objects that are not equal may have the same hash code. However, for good performance, we expect the hash function to divide our keys into m groups of roughly equal length.

At right are hash codes and hash values for 12 representative string keys, with m = 5. The hash codes and hash values on your system may differ, due to a different or randomized hash() implementation.

With this preparation, implementing an efficient symbol table with hashing is a straightforward extension of the sequential search code that we have already considered. For the keys, we maintain an array of m lists, with element i containing a Python list of keys whose hash value is i. For the values, we maintain a parallel array of m lists, so that when we have located a key, we can access the corresponding value using the same indices. PROGRAM 4.4.3 (hashst.py) is a full implementation, using a fixed number of m lists (1,024 by default).

The efficiency of hashst.py depends on the value of m and the quality of the hash function. Assuming the hash function reasonably distributes the keys, performance is about m times faster than that for sequential search, at the cost of m extra references and lists. This is a classic space–time tradeoff: the higher the value of m, the more space we use, but the less time we spend.

Image

Program 4.4.3 Hash table (hashst.py)

import stdio
import stdarray

class SymbolTable:

    def __init__(self, m=1024):
        self._m = m
        self._keys = stdarray.create2D(m, 0)
        self._vals = stdarray.create2D(m, 0)

    def __getitem__(self, key):
        i = hash(key) % self._m
        for j in range(len(self._keys[i])):
            if self._keys[i][j] == key:
                return self._vals[i][j]
        raise KeyError

    def __setitem__(self, key, val):
        i = hash(key) % self._m
        for j in range(len(self._keys[i])):
            if self._keys[i][j] == key:
                self._vals[i][j] = val
                return
        self._keys[i] += [key]
        self._vals[i] += [val]


instance variables
  _m  | number of lists
_keys | array of lists (keys)
_vals | array of lists (values)

This program uses two parallel arrays of lists to implement a hash table. Each list is represented as a Python list of varying length. The hash function selects one of the m lists. When there are n keys in the table, the average cost of a put or get operation is n/m, for suitable hash() functions. This cost per operation is constant if we use a resizing array to ensure that the average number of keys per list is between 1 and 8 (see EXERCISE 4.4.12). We defer implementations of __contains__() to EXERCISE 4.4.11 and __iter__() to EXERCISE 4.4.34.


The figure at left shows the symbol table built for our sample keys, inserted in the order given on page 648. (To save space, we omit the resizing array capacities in the Python lists in this diagram.) First, GGT is inserted in list 0, then TTA is inserted in list 3, then GCC is inserted in list 1, and so forth. After the table is built, a search for TTT starts by computing its hash value and then searching through _keys[2]. After finding the key TTT at _keys[2][1], __getitem__() returns the entry _vals[2][1], or Phenylalnine.

Image

Typically, programmers choose a large fixed value of m (like the 1,024 default we have chosen) based on a rough estimate of the number of keys to be handled. With more care, we can arrange things so that the average number of keys per list is a constant, using resizing arrays for _keys[] and _vals[]. For example, EXERCISE 4.4.12 shows how to ensure that the average number of keys per list is always between 1 and 8, which leads to constant (amortized) time performance for both put and get. There is certainly opportunity to adjust these parameters to best fit a given practical situation.

The primary disadvantage of hash tables is that they do not take advantage of order in the keys and therefore cannot provide the keys in sorted order or support efficient implementations of operations like finding the minimum or maximum. For example, the keys will come out in arbitrary order in index.py, not the sorted order that is called for. Next, we consider a symbol table implementation that can support such operations when keys are comparable, without sacrificing much performance.

Binary search trees

The binary tree is a mathematical abstraction that plays a central role in the efficient organization of information. As with arrays, linked lists, and hash tables, we use binary trees to store collections of data. Binary trees play an important role in computer programming because they strike an efficient balance between flexibility and ease of implementation.

For symbol-table implementations, we use a special type of binary tree to organize the data and to provide a basis for efficient implementations of the symbol-table put operations and get requests. A binary search tree (BST) associates comparable keys with values, in a structure defined recursively. A BST is one of the following:

• Empty (None)

• A node having a key–value pair and two references to BSTs, a left BST with smaller keys and a right BST with larger keys

Image

The keys must be comparable via the < operator. The type of the value is not specified, so a BST node can hold any kind of data in addition to the key and the (characteristic) references to BSTs. As with our linked-list definition in SECTION 4.3, the idea of a recursive data structure can be a bit mind bending, but all we are doing is adding a second link to our linked-list definition (and imposing an ordering restriction).

To implement BSTs, we start with a class for the node abstraction, which has references to a key, a value, and left and right BSTs:

class Node:
    def __init__(self, key, val):
        self.key   = key
        self.val   = val
        self.left  = None
        self.right = None

This definition is like our definition of nodes for linked lists, except that it has two links, not just one. From the recursive definition of BSTs, we can represent a BST with a variable of type Node by ensuring that its value is either None or a reference to a Node whose left and right instance variables are references to BSTs, and by ensuring that the ordering condition is satisfied (keys in the left BST are smaller than key and keys in the right BST are larger than key).

The result of Node(key, val) is a reference to a Node object whose key and val instance variables are set to the given values and whose left and right instance variables are both initialized to None.

As with linked lists, when tracing code that uses BSTs, we can use a visual representation of the changes:

• We draw a rectangle to represent each object.

• We put the values of instance variables within the rectangle.

• We depict references as arrows that point to the referenced object.

Most often, we use an even simpler abstract representation where we draw rectangles containing keys to represent nodes (suppressing the values) and connect the nodes with arrows that represent links. This abstract representation allows us to focus on the structure.

Image

For example, to build a one-node BST that associates the string key 'it' with the integer value 0, we just create a Node:

first = Node('it', 0)

Since the left and right links are both None, both refer to BSTs, so this node is a BST. To add a node that associates the key 'was' with the value 1, we create another Node:

second = Node('was', 1)

(which itself is a BST) and link to it from the right field of the first Node

first.right = second

The second node has to go to the right of the first because 'it' compares as less than 'was' (or we could have chosen to set second.left to first). Now we can add a third node that associates the key 'the' with the value 2 with the code:

third = Node('the', 2)
second.left = third

and a fourth node that associates the key 'best' with the value 3 with the code

fourth = new Node('best', 3)
first.left = fourth

Note that each of our links—first, second, third, and fourth—are, by definition, BSTs (each is either None or refers to BSTs, and the ordering condition is satisfied at each node).

We often use tree-based terminology when discussing BSTs. We refer to the node at the top as the root of the tree, the BST referenced by its left link as the left subtree, and the BST referenced by its right link as the right subtree. Traditionally, computer scientists draw trees upside down, with the root at the top. Nodes whose links are both null are called leaf nodes. The height of a tree is the maximum number of links on any path from the root node to a leaf node. Trees have many applications in science, mathematics, and computational applications, so you are certain to encounter this model on many occasions.

Image

In the present context, we take care to ensure that we always link together nodes such that every Node that we create is the root of a BST (has a key, a value, a link to a left BST with smaller values, and a link to a right BST with a larger value). From the standpoint of the BST data structure, the value is immaterial, so we often ignore it in our diagrams. We also intentionally confuse our nomenclature, using st to signify both “symbol table” and “search tree” because search trees play such a central role in symbol-table implementations.

A BST represents an ordered sequence of items. In the example considered on the previous page, first represents the sequence bestitthewas. As we have seen, we might also use an array to represent an ordered sequence. For example, we could use

Image

a = ['best', 'it', 'the', 'was']

to represent the same ordered sequence of strings. Given a set of distinct keys, there is only one way to represent the set as an ordered array, but there are many ways to represent the set as a BST (see EXERCISE 4.4.14). This flexibility allows us to develop efficient symbol-table implementations. For instance, in our example we were able to insert each new item by creating a new node and changing just one link. As it turns out, it is always possible to do so. Equally important, we can easily find a given key in the tree and find the place where we need to add a link to a new node with a given key. Next, we consider symbol-table code that accomplishes these tasks.

Suppose that you want to search for a node with a given key in a BST (or to get a value with a given key in a symbol table). There are two possible outcomes: the search might be successful (we find the key in the BST; in a symbol-table implementation, we return the associated value) or it might be unsuccessful (there is no key in the BST with the given key; in a symbol-table implementation, we raise a run-time error).

A recursive searching algorithm is immediate: Given a BST (a reference to a Node), first check whether the tree is empty (the reference is None). If so, then terminate the search as unsuccessful (in a symbol-table implementation, raise a run-time error). If the tree is nonempty, check whether the key in the node is equal to the search key. If so, then terminate the search as successful (in a symbol-table implementation, return the value associated with the key). If not, compare the search key with the key in the node. If it is smaller, search (recursively) in the left subtree; if it is greater, search (recursively) in the right subtree.

Image

Thinking recursively, it is not difficult to become convinced that this method behaves as intended, based on the invariant that the key is in the BST if and only if it is in the current subtree. The key property of the recursive method is that we always have just one node to examine to decide what to do next. Moreover, we typically examine only a small number of the nodes in the tree: whenever we go to one of the subtrees at a node, we will never examine any of the nodes in the other subtree.

Suppose that you want to insert a new node into a BST (in a symbol-table implementation, put a new key–value pair into the data structure). The logic is similar to searching for a key, but the implementation is trickier. The key to understanding it is to realize that only one link must be changed to point to the new node, and that link is precisely the link that would be found to be None in an unsuccessful search for the key in that node.

Image

If the BST is empty, we create and return a new Node containing the key–value pair; if the search key is less than the key at the root, we set the left link to the result of inserting the key–value pair into the left subtree; if the search key is greater, we set the right link to the result of inserting the key–value pair into the right subtree; otherwise, if the search key is equal, we overwrite the existing value with the new value. Resetting the left or right link after the recursive call in this way is usually unnecessary, because the link changes only if the subtree is empty, but it is as easy to set the link as it is to test to avoid setting it.


Program 4.4.4 Binary search tree (bst.py)

class OrderedSymbolTable:
    def __init__(self):
        self._root = None

    def _get(self, x, key):
        if x is None:     raise KeyError
        if key < x.key:   return self._get(x.left, key)
        elif x.key < key: return self._get(x.right, key)
        else:             return x.val

    def __getitem__(self, key):
        return self._get(self._root, key)

    def _set(self, x, key, val):
        if x is None: return _Node(key, val)
        if   key < x.key: x.left  = self._set(x.left,  key, val)
        elif x.key < key: x.right = self._set(x.right, key, val)
        else:             x.val   = val
        return x

    def __setitem__(self, key, val):
        self._root = self._set(self._root, key, val)

class _Node:
    def __init__(self, key, val):
        self.key   = key
        self.val   = val
        self.left  = None
        self.right = None


instance variable for BST
_root | BST root


instance variables for Node
  key | key
  val | value
 left | left subtree
right | right subtree

This implementation of the symbol-table data type is centered on the recursive BST data structure and recursive methods for traversing it. We defer implementations of __contains__() to EXERCISE 4.4.13 and __iter__() to page 661.


PROGRAM 4.4.4 (bst.py) is a symbol-table implementation based on these two recursive algorithms. As with linkedstack.py and linkedqueue.py, we use a private _Node class to emphasize that clients of OrderedSymbolTable do not need to know any of the details of the binary search tree representation.

If you compare bst.py with our binary search implementation binarysearch.py (PROGRAM 4.2.3) and our stack and queue implementations linkedstack.py (PROGRAM 4.3.2) and linkedqueue.py (PROGRAM 4.3.4), you will appreciate the elegance and simplicity of this code. Take the time to think recursively and convince yourself that this code behaves as intended. Perhaps the simplest way to do so is to trace the construction of an initially empty BST from a sample sequence of keys (see the diagram at right). Your ability to do so is a sure test of your understanding of this fundamental data structure.

Image

Moreover, the put and get implementations in BST are remarkably efficient; typically, each accesses a small number of the nodes in the BST (those on the path from the root to the node sought or to the null link that is replaced by a link to the new node). Next, we show that put operations and get requests take logarithmic time (under certain assumptions). Also, to add a new key–value pair to the symbol table, only one new node is created and linked onto the bottom of the tree. If you make a drawing of a BST built by inserting some keys into an initially empty tree, you certainly will be convinced of this fact—you can just draw each new node at its unique position at the bottom of the tree.

Performance characteristics of BSTs

The running times of BST algorithms are ultimately dependent on the shape of the trees, and the shape of the trees is dependent on the order in which the keys are inserted. Understanding this dependence is a critical factor in being able to use BSTs effectively in practical situations.

Best case

In the best case, the tree is perfectly balanced (each Node has exactly two children that are not None, except the nodes at the bottom, which have exactly two children that are None), with lg n nodes between the root node and each leaf node. In such a tree, it is easy to see that the cost of an unsuccessful search is logarithmic,

Image

because that cost satisfies the same recurrence relation as the cost of binary search (see SECTION 4.2) so that the cost of every put operation and get request is proportional to lg n or less. You would have to be quite lucky to get a perfectly balanced tree like this by inserting keys one by one in practice, but it is worthwhile to know the best possible performance characteristics.

Image
Average case

If we insert random keys, we might expect the search times to be logarithmic as well, because the first key becomes the root of the tree and should divide the keys roughly in half. Applying the same argument to the subtrees, we expect to get about the same result as for the best case. This intuition is, indeed, validated by careful analysis: a classic mathematical derivation shows that the time required for put and get in a tree constructed from randomly ordered keys is logarithmic (see the booksite for references). More precisely, the expected number of key comparisons is ~2 ln n for a random put or get in a tree built from n randomly ordered keys. In a practical application such as lookup.py, when we can explicitly randomize the order of the keys, this result suffices to (probabilistically) guarantee logarithmic performance. Indeed, since 2 ln n is about 1.39 lg n, the average case is only about 39% greater than the best case. In an application like index.py, where we have no control over the order of insertion, there is no guarantee, but typical data gives logarithmic performance (see EXERCISE 4.4.26). As with binary search, this fact is very significant because of the enormousness of the logarithmic–linear chasm: with a BST-based symbol table implementation, we can perform millions of operations per second (or more), even in a huge symbol table.

Worst case

In the worst case, each node has exactly one null link, so the BST is like a linked list with an extra wasted link, where put operations and get requests take linear time. Unfortunately, this worst case is not rare in practice—it arises, for example, when we insert the keys in order.

Thus, good performance of the basic BST implementation is dependent on the keys being sufficiently similar to random keys that the tree is not likely to contain many long paths. If you are not sure that assumption is justified, do not use a simple BST. Your only clue that something is amiss will be slow response time as the problem size increases. (Note: It is not unusual to encounter software of this sort!) Remarkably, there are BST variants that eliminate this worst case and guarantee logarithmic performance per operation, by making all trees nearly perfectly balanced. One popular variant is known as a red-black tree.

Image

Traversing a BST

Perhaps the most basic tree-processing function is known as tree traversal: given a (reference to) a tree, we want to systematically process every key–value pair in the tree. For linked lists, we accomplish this task by following the single link to move from one node to the next. For binary trees, however, we have decisions to make, because there are generally two links to follow. Recursion comes immediately to the rescue. To process every key in a BST:

• Process every key in the left subtree.

• Process the key at the root.

• Process every key in the right subtree.

This approach is known as inorder tree traversal, to distinguish it from preorder (do the root first) and postorder (do the root last), which arise in other applications. Given a BST, it is easy to convince yourself with mathematical induction that not only does this approach process every key in the BST, but that it processes them in key-sorted order. For example, the following method writes the keys in the BST rooted at its argument in key-sorted order:

Image

def inorder(x):
    if x is None: return
    inorder(x.left)
    stdio.writeln(x.key)
    inorder(x.right)

This remarkably simple method is worthy of careful study. It can be used as a basis for a str() implementation for BSTs (see EXERCISE 4.4.23) and as a starting point for developing the iterator that we need to provide clients with the ability to use a for loop to process the keys in key-sorted order. Indeed, one of the fundamental operations on a collection is to iterate over its items. The paradigm that we consider next leads to clear and compact code that is largely separate from the details of the collection’s implementation. In the case of BSTs, it takes advantage of this natural procedure for traversing a tree.

Iterables

As you learned in SECTION 1.3 and SECTION 1.4, you can use a for loop to iterate over either integers in a range or elements in an array a[].

for i in range(n):          for v in a:
    stdio.writeln(i)            stdio.writeln(v)

The for loop is not just for integer ranges and arrays—you can use it with any iterable object. An iterable object is an object that is capable of returning its items one at a time. All of Python’s sequence types—including list, tuple, dict, set, and str—are iterable, as is the object returned by the built-in range() function.

Now, our goal is to make SymbolTable iterable, so that we can use a for loop to iterate over its keys (and use indexing to get the corresponding values):

st = SymbolTable()
...
for key in st:
    stdio.writeln(str(key) + ' ' + str(st[key]))

To make a user-defined date type iterable, you must implement the special method __iter__(), in support of the built-in function iter(). The iter() function creates and returns an iterator, which is a data type that includes a special method __next__() that Python calls at the beginning of each iteration of a for loop.

While this appears complicated, we can use a shortcut based on the fact that Python lists are iterable: if a is a Python list, then iter(a) returns an iterator over its items. With this in mind, making our sequential-search symbol table (from page 645) iterable is a one-liner:

def __iter__():
    return iter(_keys)

Similarly, we can make our hash table and binary search tree implementations iterable by collecting the keys in a Python list and returning an iterator for that list. For example, to make bst.py iterable, we modify the recursive inorder() method from the previous page to collect the keys in a Python list instead of writing them. Then we can return an iterator for that list, as follows:

def __iter__(self):
    a = []
    self._inorder(self._root, a)
    return iter(a)

def _inorder(self, x, a):
    if x is None: return
    self._inorder(x.left, a)
    a += [x.key]
    self._inorder(x.right, a)

Python includes several built-in functions that either take iterables as arguments or return iterables, as documented in the API table below.


Python 2 alert

In Python 3, range() returns an iterable of integers; in Python 2, range() returns an array of integers.


Most of these built-in operations take linear time because the default implementation scans through all of the items in the iterable to compute the result. Of course, for performance-critical applications, we would use a symbol table instead of the built-in in construct to test whether an item is in a collection.

Next, we will see that it is also possible to develop implementations of minimum and maximum that are much more efficient than min() and max() when the underlying data structure is a BST, and we can support a broad range of other useful operations as well.

Image

Ordered symbol table operations

The flexibility of BSTs and the ability to compare keys enables the implementation of many useful operations beyond those that can be supported efficiently in hash tables. We leave implementations of these operations for exercises and leave further study of their performance characteristics and applications for a course in algorithms and data structures.

Minimum and maximum

To find the smallest key in a BST, follow left links from the root until reaching None. The last key encountered is the smallest in the BST. The same procedure following right links leads to the largest key in the BST (see EXERCISE 4.4.27).

Size and subtree sizes

To keep track of the number of nodes in a BST, keep an extra instance variable n in OrderedSymbolTable that counts the number of nodes in the tree. Initialize it to 0 and increment it whenever creating a new _Node. Alternatively, keep an extra instance variable n in each _Node that counts the number of nodes in the subtree rooted at that node (see EXERCISE 4.4.29).

Range search and range count

With a recursive method like _inorder(), we can return an iterator for keys falling between two given values in time proportional to the height of the BST plus the number of keys in the range (see EXERCISE 4.4.30). If we maintain an instance variable in each node having the size of the subtree rooted at each node, we can count the number of keys falling between two given values in time proportional to the height of the BST (see EXERCISE 4.4.31).

Order statistics and ranks

If we maintain an instance variable in each node having the size of the subtree rooted at each node, we can implement a recursive method that returns the kth smallest key in time proportional to the height of the BST (see EXERCISE 4.4.64). Similarly, we can compute the rank of a key, which is the number of keys in the BST that are strictly smaller than the key (see EXERCISE 4.4.65).

Remove

Many applications demand the ability to remove a key–value pair with a given key. You can find code for removing a node from a BST on the booksite or in a book on algorithms and data structures (see EXERCISE 4.4.32).

This list is representative; numerous other important operations have been invented for BSTs that are broadly useful in applications.

Dictionary data type

Now that you understand how a symbol table works, you are ready to use Python’s industrial-strength version. The built-in dict data type follows the same basic API as SymbolTable, but with a richer set of operations, including deletion; a version of get that returns a default value if the key is not in the dictionary; and iteration over the key–value pairs. The underlying implementation is a hash table, so ordered operations are not supported. As usual, since Python uses a lower-level language and does not impose on itself the overhead it imposes on all its users, that implementation will be more efficient and is preferred if ordered operations are not important.

As a simple example, the following dict client reads a sequence of strings from standard input, counts the number of times each string appears, and writes the strings and their frequencies. The strings do not come out in sorted order.

import stdio
while not stdio.isEmpty():
    word = stdio.readString()
    st[word] = 1 + st.get(word, 0)
for word, frequency in st.iteritems():
    stdio.writef('%4d %s ', word, frequency)

Several examples of dict clients appear in the exercises at the end of this section.

Image

Set data type

As a final example, we consider a data type that is simpler than a symbol table, still broadly useful, and easy to implement with hashing or with BSTs. A set is a collection of distinct keys, like a symbol table with no values. For example, we could implement a set by deleting references to values in hashst.py or bst.py (see EXERCISES 4.4.20 and 4.4.21). Again, Python provides a set data type that is implemented in a lower-level language. A partial API is given at the bottom of this page.

For example, consider the task of reading a sequence of strings from standard input and writing the first occurrence of each string (thereby removing duplicates). We might use a set, as in the following client code:

import stdio
distinct = set()
while not stdio.isEmpty():
    key = stdio.readString()
    if key not in distinct:
        distinct.add(key)
        stdio.writeln(key)

You can find several other examples of set clients in the exercises at the end of this section.

Image

Perspective

Symbol-table implementations are a prime topic of further study in algorithms and data structures. Different APIs and different assumptions about keys call for different implementations. Methods are available that have even better performance than hashing and BSTs under various circumstances. Examples include balanced BSTs and tries. Implementations of many of these algorithms and data structures are found in Python and most other computational environments. Researchers in algorithms and data structures still study symbol-table implementations of all sorts.

Which symbol-table implementation is better, hashing or BSTs? The first point to consider is whether the client has comparable keys and needs symbol-table operations that involve ordered operations such as selection and rank. If so, then you need to use BSTs. If not, most programmers are likely to use hashing, though that choice requires caution in two respects: (1) you need check that you have a good hash function for the key data type, and (2) you need to make sure that the hash table is appropriately sized, either through resizing arrays or an appropriate application-dependent choice.

Should you use Python’s built-in dict and set data types? Of course, if they support the operations that you need, because they are written in a lower-level language, not subject to the overhead Python imposes on user code, and therefore are likely to be faster than anything that you could implement yourself. But if your application needs order-based operations like finding the minimum or maximum, you may wish to consider BSTs.

People use dictionaries, indexes, and other kinds of symbol tables every day. Applications based on symbol tables have completely replaced phone books, encyclopedias, and all sorts of physical artifacts that served us well in the last millennium. Without symbol-table implementations based on data structures such as hash tables and BSTs, such applications would not be feasible; with them, we have the feeling that anything that we need is instantly accessible online.

Q&A

Q. Can I use an array (or Python list) as a key in a dict or set?

A. No, the built-in list data type is mutable, so you should not use arrays as keys in a symbol table or set. In fact, Python lists are not hashable, so you cannot use them as keys in a dict or set. The built-in tuple data type is immutable (and hashable), so you can that instead.

Q. Why doesn’t my user-defined data type work with dict or set?

A. By default, user-defined types are hashable, with hash(x) returning id(x) and == testing reference equality. While these default implementations satisfy the hashable requirements, they rarely provide the behavior you want.

Q. Why can’t I return a Python list directly in the special method __iter__()? Why must I instead call the built-in iter() function with the Python list as an argument?

A. A Python list is an iterable object (because it has an __iter__() method that returns an iterator) but it is not an iterator.

Q. Which data structure does Python use to implement dict and set?

A. Python uses an open-addressing hash table, which is a cousin of the separate-chaining hash table we considered in this section. Python’s implementation is highly optimized and written in a low-level programming language.

Q. Does Python provide language support for specifying set and dict objects?

A. Yes, you can specify a set by enclosing in curly braces a comma-separated list of its items. You can specify a dict by enclosing in curly braces a comma-separated list of its key–value pairs, with a colon between each key and its associated value.

stopwords = {'and', 'at', 'of', 'or', on', 'the', 'to'}
grades = {'A+':4.33, 'A':4.0, 'A-':3.67, 'B+':3.33, 'B':3.0}

Q. Does Python provide a built-in data type for an ordered symbol table (or ordered set) that supports ordered iteration, order statistics, and range search?

A. No. If you need only ordered iteration (with comparable keys), you could use Python’s dict data type and sort the keys (and pay a performance hit for sorting). For example, if you use a dict instead of a binary search tree in index.py, you can arrange to write the keys in sorted order by using code like

for word in sorted(st):

If you need other ordered symbol table operations (such as range search or order statistics), you can use our binary search tree implementation (and pay a performance hit for using a data type that is implemented in Python).

Exercises

4.4.1 Modify lookup.py to make a program lookupandput.py that allows put operations to be specified on standard input. Use the convention that a plus sign indicates that the next two strings typed are the key–value pair to be inserted.

4.4.2 Modify lookup.py to make a program lookupmultiple.py that handles multiple values having the same key by collecting the values in an array, as in index.py, and then writing them all out on a get request, as follows:

% python lookupmultiple.py amino.csv 3 0
Leucine
TTA TTG CTT CTC CTA CTG

4.4.3 Modify index.py to make a program indexbykeyword.py that takes a file name from the command line and makes an index from standard input using only the keywords in that file. Note: Using the same file for indexing and keywords should give the same result as index.py.

4.4.4 Modify index.py to make a program indexlines.py that considers only consecutive sequences of letters as keys (no punctuation or numbers) and uses line numbers instead of word position as the value. This functionality is useful for programs: When given a Python program as input, indexlines.py should write an index showing each keyword or identifier in the program (in sorted order), along with the line numbers on which it occurs.

4.4.5 Develop an implementation OrderedSymbolTable of the symbol-table API that maintains parallel arrays of keys and values, keeping them in key-sorted order. Use binary search for get, and move larger elements to the right by one position for put (using resizing arrays to keep the array length linear in the number of key–value pairs in the table). Test your implementation with index.py, and validate the hypothesis that using such an implementation for index.py takes time proportional to the product of the number of strings and the number of distinct strings in the input.

4.4.6 Develop an implementation LinkedSymbolTable of the symbol-table API that maintains a linked list of nodes containing keys and values, keeping them in arbitrary order. Test your implementation with index.py, and validate the hypothesis that using such an implementation for index.py takes time proportional to the product of the number of strings and the number of distinct strings in the input.

4.4.7 Compute hash(x) % 5 for the single-character keys

E A S Y Q U E S T I O N

In the style of the drawing in the text, draw the hash table created when the ith key in this sequence is associated with the value i, for i from 0 to 11.

4.4.8 What is wrong with the following __hash__() implementation?

def __hash__(self):
    return -17

Solution: While technically it satisfies the conditions needed for a data type to be hashable (if two objects are equal, they have the same hash value), it will lead to poor performance because we expect hash(x) % m to divide keys into m groups of roughly equal size.

4.4.9 Extend Complex (PROGRAM 3.2.6) and Vector (PROGRAM 3.3.3) to make them hashable by implementing the special methods __hash__() and __eq__().

4.4.10 Implement the special method __contains__() for the sequential-search symbol table on page 645.

4.4.11 Implement the special method __contains__() for hashst.py.

4.4.12 Modify hashst.py to use a resizing array so that the average length of the list associated with each hash value is between 1 and 8.

4.4.13 Implement the special method __contains__() for bst.py.

4.4.14 Draw all the different BSTs that can represent the key sequence

best of it the time was

4.4.15 Draw the BST that results when you insert items with keys

E A S Y Q U E S T I O N

in that order into an initially empty tree. What is the height of the resulting BST?

4.4.16 Suppose we have integer keys between 1 and 1000 in a BST and search for 363. Which of the following cannot be the sequence of keys examined?

a. 2 252 401 398 330 363

b. 399 387 219 266 382 381 278 363

c. 3 923 220 911 244 898 258 362 363

d. 4 924 278 347 621 299 392 358 363

e. 5 925 202 910 245 363

4.4.17 Suppose that the following 31 keys appear (in some order) in a BST of height 5:

10 15 18 21 23 24 30 30 38 41 42 45 50 55 59
60 61 63 71 77 78 83 84 85 86 88 91 92 93 94 98

Draw the top three nodes of the tree (the root and its two children).

4.4.18 Describe the effect on performance if you replaced hashst with bst in lookup.py. To protect against the worst case, call stdrandom.shuffle(database)before processing client queries.

4.4.19 True or false: Given a BST, let x be a leaf node, and let p be its parent. Then either (i) the key of p is the smallest key in the BST larger than the key of x or (ii) the key of p is the largest key in the BST smaller than the key of x.

4.4.20 Modify the class SymbolTable in hashst.py to make a class Set that implements the constant-time operations in the partial API given in the text for Python’s built-in set data type.

4.4.21 Modify the class OrderedSymbolTable in bst.py to make a class OrderedSet that implements the constant-time operations in the partial API given in the text for Python’s built-in set data type, assuming that the keys are comparable.

4.4.22 Modify hashst.py to support the client code del st[key] by adding a method __delitem__() that takes a key argument and removes that key (and the corresponding value) from the symbol table, if it exists. As in EXERCISE 4.4.12, use a resizing array to ensure that the average length of the list associated with each hash value is between 1 and 8.

4.4.23 Implement __str__() for bst.py, using a recursive helper method like traverse(). As usual, you can accept quadratic performance because of the cost of string concatenation. Extra credit: Compose a linear-time __str__() method for bst.py that uses an array and the join() method of Python’s built-in str data type.

4.4.24 A concordance is an alphabetical list of the words in a text that gives all word positions where each word appears. Thus, python index.py 0 0 produces a concordance. In a famous incident, one group of researchers tried to establish credibility while keeping details of the Dead Sea Scrolls secret from others by making public a concordance. Compose a program invertconcordance.py that takes a command-line argument n, reads a concordance from standard input, and writes the first n words of the corresponding text on standard output.

4.4.25 Run experiments to validate the claims in the text that the put operations and get requests for lookup.py are constant-time operations when using hashst.py with resizing arrays, as described in EXERCISE 4.4.12. Develop test clients that generate random keys and also run tests for various data sets, either from the booksite or of your own choosing.

4.4.26 Run experiments to validate the claims in the text that the put operations and get requests for index.py are logarithmic in the size of the symbol table when using bst.py. Develop test clients that generate random keys and also run tests for various data sets, either from the booksite or of your own choosing.

4.4.27 Modify bst.py to add methods min() and max() that return the smallest (or largest) key in the table (or None if the table is empty).

4.4.28 Modify bst.py to add methods floor() and ceiling() that take as an argument a key and return the largest (smallest) key in the set that is no larger (no smaller) than the given key.

4.4.29 Modify bst.py to support the special len() function by implementing a special method __len__() that returns the number of key–value pairs in the symbol table. Use the approach of storing within each _Node the number of nodes in the subtree rooted there.

4.4.30 Modify bst.py to add a method rangeSearch() that take two keys lo and hi as arguments and return an iterator over all keys that are between lo and hi. The running time should be proportional to the height plus the number of keys in the range.

4.4.31 Modify bst.py to add a method rangeCount() that takes keys as arguments and returns the number of keys in a BST between the two given keys. Your method should take time proportional to the height of the tree. Hint: First complete the previous exercise.

4.4.32 Modify bst.py to support the client code del st[key] by adding a method __delitem__() that takes a key argument and removes that key (and the corresponding value) from the symbol table, if it exists. Hint: This operation is more difficult than it might seem. Replace the key and its associated value with the next largest key in the BST and its associated value; then remove from the BST the node that contained the next largest key.

4.4.33 Implement the special method __iter__() for the sequential-search symbol table on page 645.

4.4.34 Implement the special method __iter__() for hashst.py to support iteration.

Solution: Collect all the keys in a list.

def __iter__(self):
      a = []
      for i in range(self._m):
          a += self._keys[i]
      return iter(a)

4.4.35 Modify the symbol-table API to handle values with duplicate keys by having get() return an iterator for the values having a given key. Reimplement hashst.py and bst.py as dictated by this API. Discuss the pros and cons of this approach versus the one given in the text.

4.4.36 Suppose that a[] is an array of hashable objects. What is the effect of the following statement?

a = list(set(a))

4.4.37 Recompose lookup.py and index.py using a dict instead of using hashst.py and bst.py, respectively. Compare performance.

4.4.38 Compose a dict client that creates a symbol table mapping letter grades to numerical scores, as in the table below, and then reads from standard input a list of letter grades and computes their average (GPA).

 A+   A    A-   B+   B    B-   C+   C    C-   D    F
4.33 4.00 3.67 3.33 3.00 2.67 2.33 2.00 1.67 1.00 0.00

4.4.39 Implement the methods buy() and sell() methods in stockaccount.py (PROGRAM 3.2.8). Use a dict to store the number of shares of each stock.

Binary Tree Exercises

The following exercises are intended to give you experience in working with binary trees that are not necessarily BSTs. They all assume a Node class with three instance variables: a positive double value and two Node references. As with linked lists, you will find it helpful to make drawings using the visual representation shown in the text.

4.4.40 Implement the following functions, each of which takes as an argument a Node that is the root of a binary tree.

size(node)   number of nodes in the tree rooted at node
leaves(node) number of nodes in the tree rooted at node whose links are both None
total(node)  sum of the key values in all nodes in the tree rooted at node

Your methods should all run in linear time.

4.4.41 Implement a linear-time function height() that returns the maximum number of nodes on any path from the root to a leaf node (the height of the empty tree is 0; the height of a one-node tree is 1).

4.4.42 A binary tree is heap-ordered if the key at the root is larger than the keys in all of its descendants. Implement a linear-time function heapOrdered() that returns True if the tree is heap-ordered, and False otherwise.

4.4.43 Given a binary tree, a single-value subtree is a maximal subtree that contains the same value. Design a linear-time algorithm that counts the number of single-value subtrees in a binary tree.

4.4.44 A binary tree is balanced if both its subtrees are balanced and the height of its two subtrees differ by at most 1. Implement a linear-time method balanced() that returns True if the tree is balanced, and False otherwise.

4.4.45 Two binary trees are isomorphic if only their key values differ (that is, they have the same shape). Implement a linear-time function isomorphic() that takes two tree references as arguments and returns True if they refer to isomorphic trees, and False otherwise. Then, implement a linear-time function eq() that takes two tree references as arguments and returns True if they refer to identical trees (isomorphic with the same key values), and False otherwise.

4.4.46 Compose a function levelOrder() that writes BST keys in level order: first write the root; then the nodes one level below the root, from left to right; then the nodes two levels below the root, from left to right; and so forth. Hint: Use a Queue.

4.4.47 Implement a linear-time function isBST() that returns True if the binary tree is a BST, and False otherwise.

Solution: This task is a bit more difficult than it might seem. Use a recursive helper function _inRange() that takes two additional arguments lo and hi and returns True if the binary tree is a BST and all its values are between lo and hi, and use None to represent both the smallest possible key and the largest possible key.

def _inRange(node, lo, hi):
    if node is None:  return True
    if (lo is not None) and (node.item <= lo):  return False
    if (hi is not None) and (hi <= node.item):  return False
    if not _inRange(node.left, lo, node.item):  return False
    if not _inRange(node.right, node.item, hi): return False
    return True
def _isBST(node):
    return _inRange(node, None, None)

We note that this implementation uses both the < and <= operators, whereas our binary search tree code uses only the < operator.

4.4.48 Compute the value returned by mystery() on some sample binary trees, and then formulate a hypothesis about the value and prove it.

def mystery(node):
    if node is None: return 1
    return mystery(node.left) + mystery(node.right)

Creative Exercises

4.4.49 Spell checking. Compose a set client spellchecker.py that takes as a command-line argument the name of a file containing a dictionary of words, and then reads strings from standard input and writes any string that is not in the dictionary. You can find a dictionary file on the booksite. Extra credit: Augment your program to handle common suffixes such as -ing or -ed.

4.4.50 Spell correction. Compose a dict client spellcorrector.py that serves as a filter that replaces commonly misspelled words on standard input with a suggested replacement, writing the result to standard output. Take as a command-line argument a file that contains common misspellings and corrections. You can find an example on the booksite.

4.4.51 Web filter. Compose a set client webblocker.py that takes as a command-line argument the name of a file containing a list of objectionable websites, and then reads strings from standard input and writes only those websites not on the list.

4.4.52 Set operations. Add the methods union() and intersection() to OrderedSet (EXERCISE 4.4.21), each of which takes two sets as arguments and that return the union and intersection, respectively, of those two sets.

4.4.53 Frequency symbol table. Develop a data type FrequencyTable that supports the following operations: click() and count(), both of which take string arguments. The data-type value is an integer that keeps track of the number of times the click() operation has been called with the given string as an argument. The click() operation increments the count by 1, and the count() operation returns the value, possibly 0. Clients of this data type might include a web traffic analyzer, a music player that counts the number of times each song has been played, phone software for counting calls, and so forth.

4.4.54 1D range searching. Develop a data type that supports the following operations: insert a date, search for a date, and count the number of dates in the data structure that lie in a particular interval. Use Python’s datetime.Date data type.

4.4.55 Non-overlapping interval search. Given a list of non-overlapping intervals of integers, compose a function that takes an integer argument and determines in which, if any, interval that value lies. For example, if the intervals are 1643–2033, 5532–7643, 8999–10332, and 5666653–5669321, then the query point 9122 lies in the third interval and 8122 lies in no interval.

4.4.56 IP lookup by country. Compose a dict client that uses the data file ip-to-country.csv found on the booksite to determine from which country a given IP address is coming. The data file has five fields: beginning of IP address range, end of IP address range, two-character country code, three-character country code, and country name. The IP addresses are non-overlapping. Such a database tool can be used for credit card fraud detection, spam filtering, auto-selection of language on a website, and web server log analysis.

4.4.57 Inverted index of web pages with single-word queries. Given a list of web pages, create a symbol table of words contained in the web pages. Associate with each word a list of web pages in which that word appears. Compose a program that reads in a list of web pages, creates the symbol table, and supports single-word queries by returning the list of web pages in which that query word appears.

4.4.58 Inverted index of web pages with multi-word queries. Extend the previous exercise so that it supports multi-word queries. In this case, output the list of web pages that contain at least one occurrence of each of the query words.

4.4.59 Multiple-word search (unordered). Compose a program that takes k keywords from the command line, reads in a sequence of words from standard input, and identifies the smallest interval of text that contains all of the k keywords (not necessarily in the same order). You do not need to consider partial words.

4.4.60 Multiple-word search (ordered). Repeat the previous exercise, but now assume the keywords must appear in the same order as specified.

4.4.61 Repetition draw in chess. In the game of chess, if a board position is repeated three times with the same side to move, the side to move can declare a draw. Describe how you could test this condition using a computer program.

4.4.62 Registrar scheduling. The registrar at a prominent Northeastern university recently scheduled an instructor to teach two different classes at the same exact time. Help the registrar prevent future mistakes by describing a method to check for such conflicts. For simplicity, assume all classes run for 50 minutes and start at 9, 10, 11, 1, 2, or 3.

4.4.63 Entropy. We define the relative entropy of a text corpus with n words, k of which are distinct as

E = 1 / (n lg n) (p0 lg(k/p0) + p1 lg(k/p1) +... +pk–1 lg(k/pk–1))

where pi is the fraction of times that word i appears. Compose a program that reads in a text corpus and writes the relative entropy. Convert all letters to lowercase and treat punctuation marks as whitespace.

4.4.64 Order statistics. Add to bst.py a method select() that takes an integer argument k and returns the kth smallest key in the BST. Maintain subtree sizes in each node (see EXERCISE 4.4.29). The running time should be proportional to the height of the tree.

4.4.65 Rank query. Add to bst.py a method rank() that takes a key as an argument and returns the number of keys in the BST that are strictly smaller than key. Maintain subtree sizes in each node (see EXERCISE 4.4.29). The running time should be proportional to the height of the tree.

4.4.66 Random element. Add to bst.py a method random() that returns a random key. Maintain subtree sizes in each node (see EXERCISE 4.4.29). The running time should be proportional to the height of the tree.

4.4.67 Queue with no duplicates. Create a data type that is a queue, except that an element may appear on the queue at most once at any given time. Ignore requests to insert an item if it is already on the queue.

4.4.68 Unique substrings of a given length. Compose a program that reads in text from standard input and calculates the number of unique substrings of a given length k that it contains. For example, if the input is CGCGGGCGCG, then there are five unique substrings of length 3: CGC, CGG, GCG, GGC, and GGG. This calculation is useful in data compression. Hint: Use the string slice s[i:i+k] to extract the ith substring and insert into a symbol table. Test your program on a large genome from the booksite and on the first 10 million digits of π.

4.4.69 Generalized queue. Implement a class that supports the following API:

Image

Use a BST that associates the kth element inserted with the key k and maintains in each node the total number of nodes in the subtree rooted at that node. To find the i th least recently added item, search for the ith smallest element in the BST.

4.4.70 Dynamic discrete distribution. Create a data type that supports the following two operations: add() and random(). The add() method should insert a new item into the data structure if it has not been seen before; otherwise, it should increase its frequency count by 1. The random() method should return an element at random, where the probabilities are weighted by the frequency of each element. Use space proportional to the number of items.

4.4.71 Password checker. Compose a program that takes a string as a command-line argument and a dictionary of words from standard input, and checks whether the string is a “good” password. Here, assume “good” means that it (1) is at least eight characters long, (2) is not a word in the dictionary, (3) is not a word in the dictionary followed by a digit 0-9 (e.g., hello5), (4) is not two words in the dictionary concatenated together (e.g., helloworld), and (5) none of (2) through (4) hold for reverses of words in the dictionary.

4.4.72 Random phone numbers. Compose a program that takes a command-line argument n and writes n random phone numbers of the form (xxx) xxx-xxxx. Use a set to avoid choosing the same number more than once. Use only legal area codes (you can find a file of such codes on the booksite).

4.4.73 Sparse vectors. An n-dimensional vector is sparse if its number of nonzero values is small. Your goal is to represent a vector with space proportional to its number of nonzeros, and to be able to add two sparse vectors in time proportional to the total number of nonzeros. Implement a class that supports the following API:

Image

4.4.74 Sparse matrices. An n-by-n matrix is sparse if its number of nonzeros is proportional to n (or less). Your goal is to represent a matrix with space proportional to n, and to be able to add and multiply two sparse matrices in time proportional to the total number of nonzeros (perhaps with an extra log n factor). Implement a class that supports the following API:

Image

4.4.75 Mutable string. Create a data type that supports the following API on a string. Use a BST to implement all operations in logarithmic time.

4.4.76 Assignment statements. Compose a program to parse and evaluate programs consisting of assignment and write statements with fully parenthesized arithmetic expressions (see PROGRAM 4.3.3). For example, given the input

A = 5
B = 10
C = A + B
D = C * C
write(D)

your program should write the value 225. Assume that all variables and values are floats. Use a symbol table to keep track of variable names.

4.4.77 Codon usage table. Compose a program that uses a symbol table to write summary statistics for each codon in a genome taken from standard input (frequency per thousand), like the following:

UUU 13.2 UCU 19.6 UAU 16.5 UGU 12.4
UUC 23.5 UCC 10.6 UAC 14.7 UGC  8.0
UUA  5.8 UCA 16.1 UAA  0.7 UGA  0.3
UUG 17.6 UCG 11.8 UAG  0.2 UGG  9.5
CUU 21.2 CCU 10.4 CAU 13.3 CGU 10.5
CUC 13.5 CCC  4.9 CAC  8.2 CGC  4.2
CUA  6.5 CCA 41.0 CAA 24.9 CGA 10.7
CUG 10.7 CCG 10.1 CAG 11.4 CGG  3.7
AUU 27.1 ACU 25.6 AAU 27.2 AGU 11.9
AUC 23.3 ACC 13.3 AAC 21.0 AGC  6.8
AUA  5.9 ACA 17.1 AAA 32.7 AGA 14.2
AUG 22.3 ACG  9.2 AAG 23.9 AGG  2.8
GUU 25.7 GCU 24.2 GAU 49.4 GGU 11.8
GUC 15.3 GCC 12.6 GAC 22.1 GGC  7.0
GUA  8.7 GCA 16.8 GAA 39.8 GGA 47.2

4.5 Case Study: Small-World Phenomenon

The mathematical model that we use for studying the nature of pairwise connections among entities is known as the graph. Graphs are important for studying the natural world and for helping us to better understand and refine the networks that we create. From models of the nervous system in neurobiology, to the study of the spread of infectious diseases in medical science, to the development of the telephone system, graphs have played a critical role in science and engineering over the past century, including the development of the Internet itself.

Some graphs exhibit a specific property known as the small-world phenomenon. You may be familiar with this property, which is sometimes known as six degrees of separation. It is the basic idea that, even though each of us has relatively few acquaintances, there is a relatively short chain of acquaintances (the six degrees of separation) separating us from one another. This hypothesis was validated experimentally by Stanley Milgram in the 1960s and modeled mathematically by Duncan Watts and Stephen Strogatz in the 1990s. In recent years, the principle has proved important in a remarkable variety of applications. Scientists are interested in small-world graphs because they model natural phenomena, and engineers are interested in building networks that take advantage of the natural properties of small-world graphs.

In this section, we address basic computational questions surrounding the study of small-world graphs. Indeed, the simple question

Does a given graph exhibit the small-world phenomenon?

can present a significant computational burden. To address this question, we will consider a graph-processing data type and several useful graph-processing clients. In particular, we will examine a client for computing shortest paths, a computation that has a vast number of important applications in its own right.

A persistent theme of this section is that the algorithms and data structures that we have been studying play a central role in graph processing. Indeed, you will see that several of the fundamental data types introduced earlier in this chapter help us to develop elegant and efficient code for studying the properties of graphs.

Graphs

We begin with some basic definitions. A graph is composed of a set of vertices and a set of edges. Each edge represents a connection between two vertices. Two vertices are neighbors if they are connected by an edge, and the degree of a vertex is its number of neighbors. Note that there is no relationship between a graph and the idea of a function graph (a plot of a function values) or the idea of graphics (drawings). We often visualize graphs by drawing labeled geometric shapes (vertices) connected by lines (edges), but it is always important to remember that it is the connections that are essential, not the way we depict them.

Image

The following list suggests the diverse range of systems where graphs are appropriate starting points for understanding structure.

Transportation systems

Train tracks connect stations, roads connect intersections, and airline routes connect airports, so all of these systems naturally admit a simple graph model. No doubt you have used applications that are based on such models when getting directions from an interactive mapping program or a GPS device, or using an online service to make travel reservations. What is the best way to get from here to there?

Image
Human biology

Arteries and veins connect organs, synapses connect neurons, and joints connect bones, so an understanding of the human biology depends on understanding appropriate graph models. Perhaps the largest and most important such modeling challenge in this arena is the human brain. How do local connections among neurons translate to consciousness, memory, and intelligence?

Social networks

People have relationships with other people. From the study of infectious diseases to the study of political trends, graph models of these relationships are critical to our understanding of their implications. Hoes does information propagate in online networks?

Physical systems

Atoms connect to form molecules; molecules connect to form a material or a crystal; and particles are connected by mutual forces such as gravity or magnetism. Graph models are appropriate for studying the percolation problem that we considered in SECTION 2.4, the interacting charges that we considered in SECTION 3.1, and the n-body problem that we considered in SECTION 3.4. How do local interactions propagate through such systems as they evolve?

Image
Communications systems

From electric circuits, to the telephone system, to the Internet, to wireless services, communications systems are all based on the idea of connecting devices. For at least the past century, graph models have played a critical role in the development of such systems. What is the best way to connect the devices?

Resource distribution

Power lines connect power stations and home electrical systems, pipes connect reservoirs and home plumbing, and truck routes connect warehouses and retail outlets. The study of effective and reliable means of distributing resources depends on accurate graph models. Where are the bottlenecks in a distribution system?

Mechanical systems

Trusses or steel beams connect joints in a bridge or a building. Graph models help us to design these systems and to understand their properties. Which forces must a joint or a beam withstand?

Software systems

Methods in one program module invoke methods in other modules. As we have seen throughout this book, understanding relationships of this sort is a key to success in software design. Which modules will be affected by a change in an API?

Financial systems

Transactions connect accounts, and accounts connect customers to financial institutions. These are but a few of the graph models that people use to study complex financial transactions, and to profit from better understanding them. Which transactions are routine and which are indicative of a significant event that might translate into profits?

Image
Image

Some of these graphs are models of natural phenomena, where our goal is to gain a better understanding of the natural world by developing simple models and then using them to formulate hypotheses that we can test. Other graph models are of networks that we engineer, where our goal is to build a better network or to better maintain a network by understanding its basic characteristics.

Graphs are useful models whether they are small or massive. A graph having just dozens of vertices and edges (for example, one modeling a chemical compound, where vertices are molecules and edges are bonds) is already a complicated combinatorial object because there are a huge number of possible graphs, so understanding the structures of the particular ones at hand is important. A graph having billions or trillions of vertices and edges (for example, a government database containing all phone calls or a graph model of the human nervous system) is vastly more complex, and presents significant computational challenges.

Processing graphs typically involves building a graph from information in a database and then answering questions about the graph. Beyond the application-specific questions in the examples just cited, we often need to ask basic questions about graphs. How many vertices and edges does the graph have? What are the neighbors of a given vertex? Some questions depend on an understanding of the structure of a graph. For example, a path in a graph is a sequence of vertices connected by edges. Is there a path connecting two given vertices? What is the shortest such path? What is the maximum length of a shortest path in the graph (the graph’s diameter)? We have already seen in this book several examples of questions from scientific applications that are much more complicated than these. What is the probability that a random surfer will land on each vertex? What is the probability that a system represented by a certain graph percolates?

Image

As you encounter complex systems in later courses, you are certain to encounter graphs in many different contexts. You may also study their properties in detail in later courses in mathematics, operations research, or computer science. Some graph-processing problems present insurmountable computational challenges; others can be solved with relative ease with data-type implementations of the sort we have been considering.

Graph data type

Graph-processing algorithms generally first build an internal representation of a graph by adding edges, then process it by iterating through the vertices and through the edges that are adjacent to a vertex. The API at the bottom of this page supports such processing. As usual, this API reflects several design choices, each made from among various alternatives, some of which we now briefly discuss.

Undirected graph

Edges are undirected: an edge that connects v to w is the same as one that connects w to v. Our interest is in the connection, not the direction. Directed edges (for example, one-way streets in road maps) require a slightly different data type (see EXERCISE 4.5.38).

String vertex type

We assume that vertices are strings. We might use a more general vertex type, to allow clients to build graphs with objects of any comparable or hashable type. We leave such implementations to EXERCISE 4.5.10 because string vertices suffice for the applications that we consider here.

Image
Implicit vertex creation

When an object is used as an argument to addEdge(), we assume that it is a (string) vertex name. If no edge using that name has yet been added, our implementation creates a vertex with that name. The alternative design of having an addVertex() method requires more client code (to create the vertices) and more cumbersome implementation code (to check that edges connect vertices that have previously been created).

Self-loops and parallel edges

Although the API does not explicitly address the issue, we assume that implementations do allow self-loops (edges connecting a vertex to itself) but do not allow parallel edges (two copies of the same edge). Checking for self-loops and parallel edges is easy; our choice is to omit both checks.

Client query methods

We also include the methods countV() and countE() in our API to provide to the client the number of vertices and edges in the graph. Similarly, the methods degree(), hasVertex(), and hasEdge() are useful in client code. All of these implementations are constant-time one-liners.

None of these design decisions is sacrosanct; they are simply the choices that we have made for the code in this book. Some other choices might be appropriate in various situations, and some decisions are still left to implementations. It is wise to carefully consider the choices that you make for design decisions like this and to be prepared to defend them.

PROGRAM 4.5.1 (graph.py) implements this API. Its internal representation is a symbol table of sets: the keys are vertices and the values are the sets of neighbors—the vertices adjacent to the key. A small example is illustrated at right. To implement this representation, we use the two built-in data types dict and set that we introduced in SECTION 4.4. This choice leads to three important properties:

• Clients can efficiently iterate over the graph vertices.

• Clients can efficiently iterate over a vertex’s neighbors.

• Space used is proportional to the number of vertices plus the number of edges.

These properties follow from basic properties of dict and set. As you will see, the two iterators are at the heart of graph processing.

Image

Program 4.5.1 Graph data type (graph.py)

import sys
import stdio
from instream import InStream

class Graph:
    # See text for __str__() and __init__()

    def addEdge(self, v, w):
        if not self.hasVertex(v): self._adj[v] = set()
        if not self.hasVertex(w): self._adj[w] = set()
        if not self.hasEdge(v, w):
            self._e += 1
            self._adj[v].add(w)
            self._adj[w].add(v)
    def adjacentTo(self, v): return iter(self._adj[v])
    def vertices(self):      return iter(self._adj)
    def hasVertex(self, v):  return v in self._adj
    def hasEdge(self, v, w): return w in self._adj[v]
    def countV(self):        return len(self._adj)
    def countE(self):        return self._e
    def degree(self, v):     return len(self._adj[v])

def main():
    file = sys.argv[1]
    graph = Graph(file)
    stdio.writeln(graph)

if __name__ == '__main__': main()


instance variables
 _e  | # of edges
_adj | adjacency lists

This implementation uses the built-in types dict and set (see SECTION 4.4) to implement the graph data type. Clients can build graphs by adding edges one at a time or by reading from a file; they can process graphs by iterating over the set of all vertices or over the set of vertices adjacent to a given vertex. The test client builds a graph from the file specified on the command line.


% more tinygraph.txt
A B
A C
C G
A G
H A
B C
B H


% python graph.py tinygraph.txt
A B C G H
B A C H
C A B G
G A C
H A B


A natural way to write a Graph is to put the vertices one per line, each followed by a list of its immediate neighbors. Accordingly, we support the built-in function str() by implementing __str__() as follows:

def __str__(self):
    s = ''
    for v in self.vertices():
        s += v + ' '
        for w in self.adjacentTo(v):
            s += w + ' '
        s += ' '
    return s

The resulting string includes two representations of each edge, one for the case in which we discover that w is a neighbor of v and one for the case in which we discover that v is a neighbor of w. Many graph algorithms are based on this basic paradigm of processing each edge in the graph (twice). This implementation is intended for use only for small graphs, as the running time is quadratic in the string length on some systems (see EXERCISE 4.5.3).

The output format for str() also defines a reasonable input file format. The __init__() method supports creating a graph from a file in this format (each line is a vertex name followed by the names of neighbors of that vertex, separated by whitespace). For flexibility, we allow for the use of delimiters besides whitespace (so that, for example, vertex names may contain spaces), as follows:

def __init__(self, filename=None, delimiter=None):
    self._e = 0
    self._adj = dict()
    if filename is not None:
        instream = InStream(filename)
        while instream.hasNextLine():
            line = instream.readLine()
            names = line.split(delimiter)
            for i in range(1, len(names)):
                self.addEdge(names[0], names[i])

Note that the constructor (with the default whitespace delimiter) works properly even when the input is a list of edges, one per line, as in the test client for PROGRAM 4.5.1. Also, the constructor, with the default file name and delimiter, creates an empty graph. Adding __init__() and __str__() to Graph provides a complete data type suitable for a broad variety of applications, as we will now see.

Graph client example

As a first graph-processing client, we consider an example of social relationships—one that is certainly familiar to you and for which extensive data is readily available.

On the booksite you can find the file movies.txt (and many similar movie-cast files), which contains a list of movies and the performers who appeared in them. Each line gives the name of a movie followed by the cast (a list of the names of the performers who appeared in that movie). Since names have spaces and commas in them, we use the '/' character as a delimiter. Now you can see why we provided graph clients with a way to specify the delimiter.

If you study movies.txt, you will notice a number of characteristics that, though minor, need attention when working with the database:

• Cast lists are not in alphabetical order.

• Movie titles and performer names are Unicode strings.

• Movies have the year in parentheses after the title.

• Multiple performers with the same name are differentiated by Roman numerals within parentheses.

Depending on your terminal and operating system settings, special characters may be replaced by blanks or question marks. These types of anomalies are common when working with large amounts of real-world data. If you experience problems, consult the booksite for details on how to configure your environment to work with Unicode characters properly.

% more movies.txt
...
Tin Men (1987)/DeBoy, David/Blumenfeld, Alan/... /Geppi, Cindy/Hershey, Barbara
Tirez sur le pianiste (1960)/Heymann, Claude/.../Berger, Nicole (I)
Titanic (1997)/Mazin, Stan/...DiCaprio, Leonardo/.../Winslet, Kate/...
Titus (1999)/Weisskopf, Hermann/Rhys, Matthew/.../McEwan, Geraldine
To Be or Not to Be (1942)/Verebes, Ernö (I)/.../Lombard, Carole (I)
To Be or Not to Be (1983)/.../Brooks, Mel (I)/.../Bancroft, Anne/...
To Catch a Thief (1955)/París, Manuel/.../Grant, Cary/.../Kelly, Grace/...
To Die For (1995)/Smith, Kurtwood/.../Kidman, Nicole/.../ Tucci, Maria
...

Movie database example

Using Graph, we can compose a simple and convenient client for extracting information from movies.txt. We begin by building a Graph to better structure the information. What should the vertices and edges model? Should the vertices be movies, with edges connecting two movies if a performer has appeared in both? Should the vertices be performers, with edges connecting two performers if both have appeared in the same movie? These choices are both plausible, but which should we use? This decision affects both the client and the implementation code. Another way to proceed (which we choose because it leads to simple implementation code) is to have vertices for both the movies and the performers, with an edge connecting each movie to each performer in that movie. As you will see, programs that process this graph can answer a great variety of interesting questions for us.

PROGRAM 4.5.2 (invert.py) is a first example. It is a Graph client that takes a query, such as the name of a movie, and writes the list of performers who appear in that movie.

Image

Program 4.5.2 Using a graph to invert an index (invert.py)

import sys
import stdio
from graph import Graph

file = sys.argv[1]
delimiter = sys.argv[2]
graph = Graph(file, delimiter)

while stdio.hasNextLine():
    v = stdio.readLine()
    if graph.hasVertex(v):
        for w in graph.adjacentTo(v):
            stdio.writeln(' ' + w)


   file   | input file
delimiter | vertex delimiter
  graph   | graph
    v     | query
    w     | neighbor of v

This Graph client takes the name of a graph file and a delimiter as command-line arguments, builds a graph from the file, and then repeatedly reads a vertex name from standard input and writes the neighbors of that vertex. When the file is a movie-cast file (such as movies.txt) index, it creates a bipartite graph and amounts to an interactive inverted index.


% python invert.py tinygraph.txt " "
C
  A
  B
  G
A
  B
  C
  G
  H


% python invert.py movies.txt "/"
Da Vinci Code, The (2006)
  Fosh, Christopher
  Sciarappa, Fausto Maria
  Zaza, Shane
  L'Abidine, Dhaffer
  Bettany, Paul
  ...
Bacon, Kevin
Murder in the First (1995)
  JFK (1991)
  Novocaine (2001)
  In the Cut (2003)
  Where the Truth Lies (2005)
  ...


Typing a movie name and getting its cast does not entail much more than regurgitating the corresponding line in movies.txt. (Note that the cast list appears in arbitrary order because we use a set to represent each adjacency list.) A more interesting feature of invert.py is that you can type the name of a performer and get the list of movies in which that performer has appeared. Why does this work? Even though the database seems to connect movies to performers and not the other way around, the edges in the graph are connections that also connect performers to movies.

A graph in which connections all connect one kind of vertex to another kind of vertex is known as a bipartite graph. As this example illustrates, bipartite graphs have many natural properties that we can often exploit in interesting ways.

As we saw at the beginning of SECTION 4.4, the indexing paradigm is general and very familiar. It is worth reflecting on the fact that building a bipartite graph provides a simple way to automatically invert any index! The movies.txt database is indexed by movie, but we can query it by performer. You could use invert.py in precisely the same way to write the index words appearing on a given page or the codons corresponding to a given amino acid (see the example shown at right), or to invert any of the other indices discussed at the beginning of SECTION 4.2. Since invert.py takes the delimiter as a command-line argument, you can use it to create an interactive inverted index for a .csv file or a test file with space delimiters.

% more amino.csv
TTT,Phe,F,Phenylalanine
TTC,Phe,F,Phenylalanine
TTA,Leu,L,Leucine
TTG,Leu,L,Leucine
TCT,Ser,S,Serine
TCC,Ser,S,Serine
TCA,Ser,S,Serine
TCG,Ser,S,Serine
TAT,Tyr,Y,Tyrosine
...
GGA,Gly,G,Glycine
GGG,Gly,G,Glycine

% python invert.py amino.csv ","
TTA
  Lue
  L
  Leucine
Serine
  TCT
  TCC
  TCA
  TCG

Inverting an index

This inverted-index functionality is a direct benefit of the graph data structure. Next, we examine some of the added benefits to be derived from algorithms that process the data structure.

Shortest paths in graphs

Given two vertices in a graph, a path is a sequence of vertices connected by edges. A shortest path is one with the minimal number of edges over all such paths (there may be multiple shortest paths). Finding a shortest path connecting two vertices in a graph is a fundamental problem in computer science. Shortest paths have been famously and successfully applied to solve large-scale problems in a broad variety of applications, from routing the Internet to financial arbitrage transactions to studying the dynamics of neurons in the brain.

As an example, imagine that you are a customer of an imaginary no-frills airline that serves a limited number of cities with a limited number of routes. Assume that best way to get from one place to another is to minimize your number of flight segments, because delays in transferring from one flight to another are likely to be lengthy (or perhaps the best thing to do is to consider paying more for a direct flight on another airline!). A shortest-path algorithm is just what you need to plan a trip. Such an application appeals to intuition in understanding the basic problem and our approach to solving it. After covering these topics in the context of this example, we will consider an application where the graph model is more abstract.

Image

Depending on the application, clients have various needs with regard to shortest paths. Do we want the shortest path connecting two given vertices? Just the length of such a path? Will we have a large number of such queries? Is one particular vertex of interest? Our choice is to start with the following API:

Image

In huge graphs or for huge numbers of queries, we have to pay particular attention to API design because the cost of computing paths might prove to be prohibitive. With this design, clients can create a PathFinder object for a given graph and a given vertex, and then use that object either to find the length of the shortest path or to iterate over the vertices on a shortest path to any other vertex in the graph. An implementation of these methods is known as a single-source shortest-path algorithm. A classic algorithm known as breadth-first search provides a direct and elegant solution where the constructor takes linear time, distanceTo() takes constant time, and pathTo() takes time proportional to the length of the path. Before examining our implementation, we will consider some clients.

Single-source client

Suppose that you have available to you the graph of vertices and connections for your no-frills airline’s route map. Then, using your home city as a source, you can compose a client that writes your route anytime you want to go on a trip. PROGRAM 4.5.3 (separation.py) is a PathFinder client that provides this functionality for any graph. This sort of client is particularly useful in applications where we anticipate numerous queries from the same source. In this situation, the cost of building a PathFinder is amortized over the cost of all the queries. You are encouraged to explore the properties of shortest paths by running PathFinder on our sample input routes.txt or any input model that you choose. In fact, many people frequently use an algorithm like this, in the map application on their phone.

Degrees of separation

One of the classic applications of shortest-paths algorithms is to find the degrees of separation between individuals in a social network. To fix ideas, we discuss this application in terms of a recently popularized pastime known as the Kevin Bacon game, which uses the movie–performer graph that we just considered. Kevin Bacon is a prolific actor who has appeared in many movies. We assign every performer who has appeared in a movie a Kevin Bacon number: Bacon himself is 0, any performer who has been in the same cast as Bacon has a Kevin Bacon number of 1, any other performer (except Bacon) who has been in the same cast as a performer whose number is 1 has a Kevin Bacon number of 2, and so forth. For example, Meryl Streep has a Kevin Bacon number of 1 because she appeared in The River Wild with Kevin Bacon. Nicole Kidman’s number is 2: although she did not appear in any movie with Kevin Bacon, she was in Cold Mountain with Donald Sutherland, and Sutherland appeared in Animal House with Kevin Bacon. Given the name of a performer, the simplest version of the game is to find some alternating sequence of movies and performers that leads back to Kevin Bacon. For example, a movie buff might know that Tom Hanks was in Joe Versus the Volcano with Lloyd Bridges, who was in High Noon with Grace Kelly, who was in Dial M for Murder with Patrick Allen, who was in The Eagle Has Landed with Donald Sutherland, who we know was in Animal House with Kevin Bacon. But this knowledge does not suffice to establish Tom Hanks’s Bacon number (it is actually 1 because he was in Apollo 13 with Kevin Bacon). You can see that the Kevin Bacon number is defined by counting the movies in a shortest such sequence, so it is hard to be sure whether someone wins the game without using a computer. Remarkably, separation.py (PROGRAM 4.5.3 is just the program you need to find a shortest path that establishes the Kevin Bacon number of any performer in movies.txt—the number is precisely half the distance. You might enjoy using this program, or extending it to answer some entertaining questions about the movie business or in one of many other domains. For example, mathematicians play this same game with the graph defined by paper co-authorship and their connection to Paul Erdös, a prolific 20th-century mathematician. Similarly, everyone in New Jersey seems to have a Bruce Springsteen number of 2, because everyone in the state seems to know someone who claims to know Bruce.


Program 4.5.3 Shortest-paths client (separation.py)

import sys
import stdio
from graph import Graph
from pathfinder import PathFinder

file = sys.argv[1]
delimiter = sys.argv[2]
graph = Graph(file, delimiter)

s = sys.argv[3]
pf = PathFinder(graph, s)

while stdio.hasNextLine():
    t = stdio.readLine()
    if pf.hasPathTo(t):
        distance = pf.distanceTo(t)
        for v in pf.pathTo(t):
            stdio.writeln('   ' + v)
        stdio.writeln('distance: ' + str(distance))


  file    | name of graph file
delimiter | delimiter of vertex names
  graph   | graph
   pf     | PathFinder from s
    s     | source vertex
    t     | destination vertex
    v     | vertex on path from s to t

This PathFinder test client takes a file name, a delimiter, and a source vertex as command-line arguments. It builds a Graph from the file, assuming that each line of the file specifies a vertex and a list of vertices connected to that vertex, separated by the delimiter. When you type a destination vertex t on standard input, you get the shortest path from the source to that destination.


% more routes.txt
JFK MCO
ORD DEN
ORD HOU
DFW PHX
JFK ATL
ORD DFW
ORD PHX
ATL HOU
DEN PHX
...


Image

% python separation.py movies.txt "/" "Bacon, Kevin"
Kidman, Nicole
   Bacon, Kevin
   Animal House (1978)
   Sutherland, Donald (I)
   Cold Mountain (2003)
   Kidman, Nicole
distance 4
Hanks, Tom
   Bacon, Kevin
   Apollo 13 (1995)
   Hanks, Tom
distance 2

Degrees of separation from Kevin Bacon


Other clients

PathFinder is a versatile data type that can be put to many practical uses. For example, it is easy to develop a client that handles pairwise source–destination requests on standard input, by building a PathFinder for each vertex (see EXERCISE 4.5.11). Travel services use precisely this approach to handle requests at a very high service rate. Since this client builds a PathFinder for each vertex (each of which might consume space proportional to the number of vertices), space usage might be a limiting factor in using it for huge graphs. For an even more performance-critical application that is conceptually the same, consider an Internet router that has a graph of connections among machines available and must decide the best next stop for packets heading to a given destination. To do so, it can build a PathFinder with itself as the source, then send a packet heading to the first vertex in pf.pathTo(w)—that is, the next stop on the shortest path to w. Or a central authority might build a PathFinder for each of several dependent routers and use them to issue routing instructions. The ability to handle such requests at a high service rate is one of the prime responsibilities of Internet routers, and shortest-paths algorithms are a critical part of the process.

Shortest-path distances

We define the distance between two vertices to be the length of the shortest path between them. The first step in understanding breadth-first search is to consider the problem of computing distances between the source and each vertex (the implementation of distanceTo() in PathFinder). Our approach is to compute and store away all the distances in the constructor, and then just return the requested value when a client invokes distanceTo(). To associate an integer distance with each vertex name, we use a symbol table:

_distTo = dict()

The purpose of this symbol table is to associate with each vertex the length of the shortest path (the distance) between that vertex and s. We begin by giving s the distance 0 with _distTo[s] = 0, and we assign to s’s neighbors the distance 1 with the following code:

for v in g.adjacentTo(s)
   self._distTo[v] = 1

But then what do we do? If we blindly set the distances to all the neighbors of each of those neighbors to 2, then not only would we face the prospect of unnecessarily setting many values twice (neighbors may have many common neighbors), but we would also set s’s distance to 2 (it is a neighbor of each of its neighbors)—and we clearly do not want that outcome. The solution to these difficulties is simple:

• Consider the vertices in order of their distance from s.

• Ignore vertices whose distance to s is already known.

To organize the computation, we use a FIFO queue. Starting with s on the queue, we perform the following operations until the queue is empty:

• Dequeue a vertex v.

• Assign all of v’s unknown neighbors a distance 1 greater than v’s distance.

• Enqueue all of the unknown neighbors.

This method dequeues the vertices in nondecreasing order of their distance from the source s. Following a trace of this method on a sample graph will help persuade you that it is correct. Showing that this method labels each vertex v with its distance to s is an exercise in mathematical induction (see EXERCISE 4.5.13).

Shortest-paths tree

We need not only distances from the source, but also paths. To implement pathTo(), we use a subgraph known as the shortest-paths tree, defined as follows:

• Put the source vertex s at the root of the tree.

• Put vertex v’s neighbors in the tree if they are added to the queue, with an edge connecting each to v.

Since we enqueue each vertex only once, this structure is a proper tree: it consists of a root (the source) connected to one subtree for each neighbor of the source. Studying such a tree, you can see immediately that the distance from each vertex to the root in the tree is the same as the shortest-path distance to the source in the graph. More importantly, each path in the tree is a shortest path in the graph. This observation is important because it gives us an easy way to provide clients with the shortest paths themselves (implement pathTo() in PathFinder).

First, we maintain a symbol table associating each vertex with the vertex one step nearer to the source on the shortest path:

_edgeTo = dict()

Image

To each vertex w, we want to associate the previous stop on the shortest path from the source to w. Augmenting the shortest-distances method to also compute this information is easy: when we enqueue w because we first discover it as a neighbor of v, we do so precisely because v is the previous stop on the shortest path from the source to w, so we can assign _edgeTo[w] = v. The _prev data structure is nothing more than a representation of the shortest-paths tree: it provides a link from each node to its parent in the tree. Then, to respond to a client request for the path from the source to v (a call to pathTo(v) in PathFinder), we follow these links up the tree from v, which traverses the path in reverse order. We collect the vertices in an array as we encounter them, then reverse the array, so the client gets the path from s to v when using the iterator returned from pathTo().

Breadth-first search

PROGRAM 4.5.4 (pathfinder.py) is an implementation of the single-source shortest-paths API, based on the ideas just discussed. It maintains two symbol tables. One symbol table stores the distance between the source vertex and each vertex; the other symbol table stores the previous stop on a shortest path from the source vertex to each vertex. The constructor uses a queue to keep track of vertices that have been encountered (neighbors of vertices for which the shortest paths have been found but whose neighbors have not yet been examined). This process is known as breadth-first search (BFS) because it searches broadly in the graph.

Image

By contrast, another important graph-search method known as depth-first search is based on a recursive method like the one we used for percolation in PROGRAM 2.4.6 and searches deeply into the graph. Depth-first search tends to find long paths; breadth-first search is guaranteed to find shortest paths.

Performance

The cost of graph-processing algorithms typically depends on two graph parameters: the number of vertices V and the number of edges E. For simplicity, we assume that there is a path between the source vertex and each other vertex. Then, as implemented in PathFinder, the time required for breadth-first search is linear in the size of the input—proportional to E + V—under suitable technical assumptions about the hash function. To convince yourself of this fact, first observe that the outer (while) loop iterates exactly V times, once for each vertex, because we are careful to ensure that a vertex is not enqueued more than once. Next, observe that the inner (for) loop iterates 2E times over all iterations, because it examines each edge exactly twice—once for each of the two vertices it connects. The body of the inner (for) loop requires at least one search operation (to determine if the vertex has been previously encountered) and perhaps one get and two put operations (to update the distance and path information) on symbol tables of size at most V. Thus, the overall running time is proportional to E + V (under suitable technical assumptions about the hash function). If we were to use binary search trees instead of hash tables, then the overall running time would be linearithmic—proportional to E log V.

Image

Program 4.5.4 Shortest-paths implementation (pathfinder.py)

import stdio
import graph
from linkedqueue import Queue

class PathFinder:
    def __init__(self, graph, s):
        self._distTo = dict()
        self._edgeTo = dict()

        queue = Queue()
        queue.enqueue(s)
        self._distTo[s] = 0
        self._edgeTo[s] = None
        while not queue.isEmpty():
            v = queue.dequeue()
            for w in graph.adjacentTo(v):
                if w not in self._distTo:
                    queue.enqueue(w)
                    self._distTo[w] = 1 + self._distTo[v]
                    self._edgeTo[w] = v

    def distanceTo(self, v):
        return self._distTo[v]

   def hasPathTo(self, v):
        return v in self._distTo

    def pathTo(self, v):
        path = []
        while v is not None:
            path += [v]
            v = self._edgeTo[v]
        return reversed(path)


instance variables
_distTo | distance to s
_prevTo | previous vertex on shortest path from s


graph | graph
  s   | source
queue | queue of vertices to visit
  v   | current vertex
  w   | neighbors of v

This class allows clients to find (shortest) paths connecting a specified vertex to other vertices in a graph. See PROGRAM 4.5.3 and EXERCISE 4.5.17 for sample clients.


Adjacency-matrix representation

Without proper data structures, fast performance for graph-processing algorithms is sometimes not easy to achieve, and so should not be taken for granted. For example, another graph representation that programmers often implement, known as the adjacency-matrix representation, uses a symbol table to map vertex names to integers between 0 and V–1, then maintains a V-by-V array of booleans with True in the element in row i and column j (and, symmetrically, the element in row j and column i) if there is an edge connecting the vertex corresponding to i with the vertex corresponding to j, and False if there is no such edge. We have already used similar representations in this book, when studying the random-surfer model for ranking web pages in SECTION 1.6. The adjacency-matrix representation is simple, but infeasible for use with huge graphs—a graph with a million vertices would require an adjacency matrix with a trillion elements. Understanding this distinction for graph-processing problems makes the difference between solving a problem that arises in a practical situation and not being able to address it at all.

Image

Breadth-first search is a fundamental algorithm that you use in applications on your mobile device to find your way around an airline route map or a city subway or in numerous similar situations. As indicated by our degrees-of-separation example, it is used for countless other applications as well, from crawling the web or routing packets on the Internet, to studying infectious disease, models of the brain, or relationships among genomic sequences. Many of these applications involve huge graphs, so an efficient algorithm is absolutely essential.

An important generalization of the shortest-paths model is to associate a positive weight (which may represent distance or time) with each edge and seek to find a path that minimizes the sum of the edge weights. If you take later courses in algorithms or in operations research, you will learn a generalization of breadth-first search known as Dijkstra’s algorithm that solves this problem in linearithmic time. When you get directions from a GPS device or a map application on your phone, Dijkstras algorithm is the basis for solving the associated shortest-path problems. These important and omnipresent applications are just the tip of an iceberg, because graph models are much more general than maps.

Small-world graphs

Scientists have identified a particularly interesting class of graphs that arise in numerous applications in the natural and social sciences. Small-world graphs are characterized by the following three properties:

• They are sparse: the number of vertices is much smaller than the number of edges.

• They have short average path lengths: if you pick two random vertices, the length of the shortest path between them is short.

• They exhibit local clustering: if two vertices are neighbors of a third vertex, then the two vertices are likely to be neighbors of each other.

We refer to graphs having these three properties collectively as exhibiting the small-world phenomenon. The term small world refers to the idea that the preponderance of vertices have both local clustering and short paths to other vertices. The modifier phenomenon refers to the unexpected fact that so many graphs that arise in practice are sparse, exhibit local clustering, and have short paths. Beyond the social-relationships applications just considered, small-world graphs have been used to study the marketing of products or ideas, the formation and spread of fame and fads, the analysis of the Internet, the construction of secure peer-to-peer networks, the development of routing algorithms and wireless networks, the design of electrical power grids, modeling information processing in the human brain, the study of phase transitions in oscillators, the spread of infectious viruses (in both living organisms and computers), and many other applications. Starting with the seminal work of Watts and Strogatz in the 1990s, an intensive amount of research has gone into quantifying the small-world phenomenon.

A key question in such research is the following: Given a graph, how can we tell whether it is a small-world graph? To answer this question, we begin by imposing the conditions that the graph is not small (say, 1,000 vertices or more) and that it is connected (there exists some path connecting each pair of vertices). Then, we need to settle on specific thresholds for each of the small-world properties:

• By sparse, we mean the average vertex degree is less than 20 lg V .

• By short average path length, we mean the average length of the shortest path between two vertices is less than 10 lg V.

• By locally clustered, we mean that a certain quantity known as the clustering coefficient should be greater than 10%.

The definition of locally clustered is a bit more complicated than the definitions of sparsity and average path length. Intuitively, the clustering coefficient of a vertex represents the probability that if you pick two of its neighbors at random, they will also be connected by an edge. More precisely, if a vertex has t neighbors, then there are t (t –1)/2 possible edges that connect those neighbors; its local clustering coefficient is the fraction of those edges that are in the graph (or 0 if the vertex has degree 0 or 1). The clustering coefficient of a graph is the average of the local clustering coefficients of its vertices. If that average is greater than 10%, we say that the graph is locally clustered. The diagram below calculates these three quantities for a tiny graph.

Image

To better familiarize you with these definitions, we next define some simple graph models, and consider whether they describe small-world graphs by checking whether they exhibit the three requisite properties.

Complete graphs

A complete graph with V vertices has V (V–1) / 2 edges, one connecting each pair of vertices. Complete graphs are not small-world graphs. They have short average path length (every shortest path has length is 1) and they exhibit local clustering (the cluster coefficient is 1), but they are not sparse (the average vertex degree is V–1, which is much greater than 20 lg V for large V).

Ring graphs

A ring graph is a set of V vertices equally spaced on the circumference of a circle, with each vertex connected to its neighbor on either side. In a k-ring graph, each vertex is connected to its k nearest neighbors on either side. The diagram at right illustrates a 2-ring graph with 16 vertices. Ring graphs are also not small-world graphs. For example, 2-ring graphs are sparse (every vertex has degree 4) and exhibit local clustering (the cluster coefficient is 1/2), but their average path length is not short (see EXERCISE 4.5.17).

Random graphs

The Erdös-Renyi model is a well studied model for generating random graphs. In this model, we build a random graph on V vertices by including each possible edge with probability p. Random graphs with a sufficient number of edges are very likely to be connected and have short average path lengths but they are not small-world graphs, because they are not locally clustered (see EXERCISE 4.5.43).

Image

These examples illustrate that developing a graph model that satisfies all three properties simultaneously is a puzzling challenge. Take a moment to try to design a graph model that you think might do so. After you have thought about this problem, you will realize that you are likely to need a program to help with calculations. Also, you may agree that it is quite surprising that they are found so often in practice. Indeed, you might be wondering if any graph is a small-world graph!

Choosing 10% for the clustering threshold instead of some other fixed percentage is somewhat arbitrary, as is the choice of 20 lg V for the sparsity threshold and 10 lg V for the short paths threshold, but we often do not come close to these borderline values. For example, consider the web graph, which has a vertex for each web page and an edge connecting two web pages if they are connected by a link. Scientists estimate that the number of clicks to get from one web page to another is rarely more than about 30. Since there are billions of web pages, this estimate implies that the average length of a path between two vertices is very short, much lower than our 10 lg V threshold (which would be about 300 for 1 billion vertices).

Image

Having settled on the definitions, testing whether a graph is a small-world graph can still be a significant computational burden. As you probably have suspected, the graph-processing data types that we have been considering provide precisely the tools that we need. PROGRAM 4.5.5 (smallworld.py) is a Graph and PathFinder client that implements these tests. Without the efficient data structures and algorithms that we have been considering, the cost of this computation would be prohibitive. Even so, for large graphs (such as movies.txt), we must resort to statistical sampling to estimate the average path length and the cluster coefficient in a reasonable amount of time (see EXERCISE 4.5.41) because the functions averagePathLength() and clusteringCoefficient() take quadratic time.


Program 4.5.5 Small-world test (smallworld.py)

from pathfinder import PathFinder

def averageDegree(graph):
    return 2.0 * graph.countE() / graph.countV()

def averagePathLength(graph):
    total = 0
    for v in graph.vertices():
        pf = PathFinder(graph, v)
        for w in graph.vertices():
            total += pf.distanceTo(w)
    return 1.0 * total / (graph.countV() * (graph.countV() - 1))

def clusteringCoefficient(graph):
    total = 0
    for v in graph.vertices():
        possible = graph.degree(v) * (graph.degree(v) - 1)
        actual = 0
        for u in graph.adjacentTo(v):
            for w in graph.adjacentTo(v):
                if graph.hasEdge(u, w):
                    actual += 1
        if possible > 0:
            total += 1.0 * actual / possible
    return total / graph.countV()

# See Exercise 4.5.21 for a test client.

This Graph and PathFinder client computes the values of various graph parameters to test whether the graph exhibits the small-world phenomenon.


% python smallworld.py tinygraph.txt " "
5 vertices, 7 edges
average degree         = 2.800
average path length    = 1.300
clustering coefficient = 0.767


A classic small-world graph

Our movie–performer graph is not a small-world graph, because it is bipartite and therefore has a clustering coefficient of 0. Also, some pairs of performers are not connected to each other by any paths. However, the simpler performer–performer graph defined by connecting two performers by an edge if they appeared in the same movie is a classic example of a small-world graph (after discarding performers not connected to Kevin Bacon). The diagram below illustrates the movie–performer and performer–performer graphs associated with a tiny movie-cast file.

PROGRAM 4.5.6 (performer.py) is a script that creates a performer–performer graph from a file in our movie-cast input format. Recall that each line in a movie-cast file consists of a movie followed by all of the performers who appeared in that movie, delimited by slashes. The script connects all pairs of performers in that movie by adding an edge connecting each pair. Doing so for each movie in the input produces a graph that connects the performers, as desired.

Image

Program 4.5.6 Performer–performer graph (performer.py)

import sys
import stdio
import smallworld
from graph import Graph
from instream import InStream

file      = sys.argv[1]
delimiter = sys.argv[2]
graph = Graph()
instream = InStream(file)
while instream.hasNextLine():
    line = instream.readLine()
    names = line.split(delimiter)
    for i in range(1, len(names)):
        for j in range(i+1, len(names)):
            graph.addEdge(names[i], names[j])

degree = smallworld.averageDegree(graph)
length = smallworld.averagePathLength(graph)
cluster = smallworld.clusteringCoefficient(graph)
stdio.writef('number of vertices     = %d ', graph.countV())
stdio.writef('average degree         = %7.3f ', degree)
stdio.writef('average path length    = %7.3f ', length)
stdio.writef('clustering coefficient = %7.3f ', cluster)

This script is a smallworld.py client takes the name of a movie-cast file and a delimiter as command-line arguments and creates the associated performer–performer graph. It writes to standard output the number of vertices, the average degree, the average path length, and the clustering coefficient of this graph. It assumes that the performer–performer graph is connected (see EXERCISE 4.5.22) so that the average page length is defined.


% python performer.py tinymovies.txt "/"
number of vertices     = 5
average degree         =   2.800
average path length    =   1.300
clustering coefficient =   0.767


% python performer.py moviesg.txt "/"
number of vertices     = 19044
average degree         = 148.688
average path length    =   3.494
clustering coefficient =   0.911


Since a performer–performer graph typically has many more edges than the corresponding movie–performer graph, we will work for the moment with the smaller performer–performer graph derived from the file moviesg.txt, which contains 1,261 G-rated movies and 19,044 performers (all of which are connected to Kevin Bacon). Now, performer.py tells us that the performer–performer graph associated with moviesg.txt has 19,044 vertices and 1,415,808 edges, so the average vertex degree is 148.7 (about half of 20 lg V = 284.3), which means it is sparse; its average path length is 3.494 (much less than 10 lg V = 142.2), so it has short paths; and its clustering coefficient is 0.911, so it has local clustering. We have found a small-world graph! These calculations validate the hypothesis that social relationship graphs of this sort exhibit the small-world phenomenon. You are encouraged to find other real-world graphs and to test them with smallworld.py. You will find many suggestions in the exercises at the end of this section.

One approach to understanding something like the small-world phenomenon is to develop a mathematical model that we can use to test hypotheses and to make predictions. We conclude by returning to the problem of developing a graph model that can help us to better understand the small-world phenomenon. The trick to developing such a model is to combine two sparse graphs: a 2-ring graph (which has a high cluster coefficient) and a random graph (which has small average path length).

Ring graphs with random shortcuts

One of the most surprising facts to emerge from the work of Watts and Strogatz is that adding a relatively small number of random edges to a sparse graph with local clustering produces a small-world graph. To gain some insight into why this is the case, consider a 2-ring graph, where the diameter (length of path between farthest pair of vertices) is ~ V/4 (see the figure at right). Adding a single edge connecting antipodal vertices decreases the diameter to ~ V/8 (see EXERCISE 4.5.18). Adding V/2 random “shortcut” edges to a 2-ring graph is extremely likely to significantly lower the average path length, making it logarithmic (see EXERCISE 4.5.25). Moreover, it does so while increasing the average degree by only 1 and without lowering the cluster coefficient much below 1/2. That is, a 2-ring graph with V/2 random shortcut edges is extremely likely to be a small-world graph!

Image

Generators that create graphs drawn from such models are simple to develop, and we can use smallworld.py to determine whether the graphs exhibit the small-world phenomenon (see EXERCISE 4.5.22). We also can verify the analytic results that we derived for simple graphs such as tinygraph.txt, complete graphs, and ring graphs. As with most scientific research, new questions arise as quickly as we answer old ones. How many random shortcuts do we need to add to get a short average path length? What is the average path length and the clustering coefficient in a random connected graph? Which other graph models might be appropriate for study? How many samples do we need to accurately estimate the clustering coefficient or the average path length in a huge graph? You can find in the exercises many suggestions for addressing such questions and for further investigations of the small-world phenomenon. With the basic tools and the approach to programming developed in this book, you are well equipped to address this and many other scientific questions.

Image

Lessons

This case study illustrates the importance of algorithms and data structures in scientific research. It also reinforces several of the lessons that we have learned throughout this book, which are worth repeating.

Carefully design your data type

One of our most persistent messages throughout this book is that effective programming is based on a precise understanding of the possible set of data-type values and the operations on them. Using a modern object-oriented programming language such as Python provides a path to this understanding because we design, build, and use our own data types. Our Graph data type is a fundamental one, the product of many iterations and experience with the design choices that we have discussed. The clarity and simplicity of our client code are testimony to the value of taking seriously the design and implementation of basic data types in any program.

Develop code incrementally

As with all of our other case studies, we build software one module at a time, testing and learning about each module before moving to the next.

Solve problems that you understand before addressing the unknown

Our shortest-paths example involving air routes between a few cities is a simple one that is easy to understand. It is just complicated enough to hold our interest while debugging and following through a trace, but not so complicated as to make these tasks unnecessarily laborious.

Keep testing and check results

When working with complex programs that process huge amounts of data, you cannot be too careful in checking your results. Use common sense to evaluate every bit of output that your program produces. Novice programmers have an optimistic mindset (“If the program produces an answer, it must be correct”); experienced programmers know that a pessimistic mindset (“There must be something wrong with this result”) is far better.

Use real-world data

The movies.txt file from the Internet Movie Database is just one example of the data files that are now omnipresent on the web. In past years, such data was often cloaked behind private or parochial formats, but most people are now realizing that simple text formats are much preferred. The various methods in Python’s str data type make it easy to work with real data, which is the best way to formulate hypotheses about real-world phenomena. Start working with small files in the real-world format, so that you can test and learn about performance before attacking huge files.

Reuse software

Another of our most persistent messages is that effective programming is based on an understanding of the fundamental data types available for our use, so that we do not have to rewrite code for basic functions. Our use of dict and set in Graph is a prime example—most programmers still use lower-level representations and implementations that use linked lists or arrays for graphs, which means, inevitably, that they are recomposing code for simple operations such as maintaining and traversing linked lists. Our shortest-paths class PathFinder uses dict, list, Graph, and Queue—an all-star lineup of fundamental data structures.

Performance matters

Without good algorithms and data structures, many of the problems that we have addressed in this chapter would go unsolved, because naïve methods require an impossible amount of time or space. Maintaining an awareness of the approximate resource needs of our programs is essential.

This case study is an appropriate place to end the book because the programs that we have considered are a starting point, not a complete study. This book is a starting point, too, for your further study in science, mathematics, or engineering. The approach to programming and the tools that you have learned here should prepare you well for addressing any computational problem whatsoever.

Q&A

Q. How many different graphs are there with V given vertices?

A. With no self-loops or parallel edges, there are V(V–1)/2 possible edges, each of which can be present or not present, so the grand total is 2V(V–1)/2. The number grows to be huge quite quickly, as shown in the following table:

Image

These huge numbers provide some insight into the complexities of social relationships. For example, if you just consider the next nine people that you see on the street, there are over 68 trillion mutual-acquaintance possibilities!

Q. Can a graph have a vertex that is not connected to any other vertex by an edge?

A. Good question. Such vertices are known as isolated vertices. Our implementation disallows them. Another implementation might choose to allow isolated vertices by including an explicit addVertex() method for the add a vertex operation.

Q. Why do the countV() and countE() query methods need to have constant-time implementations. Won’t most clients would call such methods only once?

A. It might seem so, but code like this

while i < g.countE():
    ...
    i += 1

would take quadratic time if you were to use a lazy implementation that counts the edges instead of maintaining an instance variable with the number of edges.

Q. Why are Graph and PathFinder in separate classes? Wouldn’t it make more sense to include the PathFinder methods in the Graph API?

A. Finding shortest paths is just one of many graph-processing algorithms. It would be poor software design to include all of them in a single interface. Please reread the discussion of wide interfaces in SECTION 3.3.

Exercises

4.5.1 Find the performer in movies.txt who has appeared in the most movies.

4.5.2 Modify the __str__() method in Graph so that it returns the vertices in sorted order (assuming the vertices are comparable). Hint: Use the built-in sorted() function.

4.5.3 Modify the __str__() method in Graph so that it runs in time linear in the number of vertices and the number of edges in the worst case. Hint: Use the join() method in the str data type. (See EXERCISE 4.1.13.)

4.5.4 Add to Graph a method copy() that creates and return a new, independent copy of the graph. Any future changes to the original graph should not affect the newly created graph (and vice versa!).

4.5.5 Compose a version of Graph that supports explicit vertex creation and allows self-loops, parallel edges, and isolated vertices (vertices of degree 0).

4.5.6 Add to Graph a method removeEdge() that takes two string arguments and deletes the specified edge from the graph, if present.

4.5.7 Add to Graph a method subgraph() that takes a set of strings as an argument and returns the induced subgraph (the graph consisting of only those vertices and only those edges from the original graph that connect any two of them).

4.5.8 Describe the advantages and disadvantages of using an array or a linked list to represent the neighbors of a vertex instead of using a set.

4.5.9 Compose a Graph client that reads a Graph from a file, then writes the edges in the graph, one per line.

4.5.10 Modify Graph so that it supports vertices of any hashable type.

4.5.11 Implement a PathFinder client allshortestpaths.py that takes the name of a graph file and a delimiter as command-line arguments, builds a PathFinder for each vertex, and then repeatedly takes from standard input the names of two vertices (on one line, separated by the delimiter) and writes the shortest path connecting them. Note: For movies.txt, the vertex names may both be performers, both be movies, or be a performer and a movie.

4.5.12 True or false: At some point during breadth-first search, the queue can contain two vertices, one whose distance from the source is 7 and one whose distance from the source is 9.

Solution: False. The queue can contain vertices of at most two distinct distances d and d+1. Breadth-first search examines the vertices in increasing order of distance from the source. When examining a vertex at distance d, only vertices of distance d+1 can be enqueued.

4.5.13 Prove by induction on the set of vertices visited that PathFinder finds the shortest-path distances from the source to each vertex.

4.5.14 Suppose you use a stack instead of a queue for breadth-first search in PathFinder. Does it still find a path? Does it still correctly compute shortest paths? In each case, prove that it does or give a counterexample.

4.5.15 Compose a program that plots average path length versus the number of random edges as random shortcuts are added to a 2-ring graph on 1,000 vertices.

4.5.16 Add an optional argument k to clusterCoefficient() in the module smallworld.py so that it computes a local cluster coefficient for the graph based on the total edges present and the total edges possible among the set of vertices within distance k of each vertex. Use the default value k=1 so that your function produces results identical to the function with the same name in smallworld.py.

4.5.17 Show that the cluster coefficient in a k-ring graph is (2k−2) / (2k−1). Derive a formula for the average path length in a k-ring graph on V vertices as a function of both V and k.

4.5.18 Show that the diameter in a 2-ring graph on V vertices is ~ V/4. Show that if you add one edge connecting two antipodal vertices, then the diameter decreases to ~V/8.

4.5.19 Perform computational experiments to verify that the average path length in a ring graph on V vertices is ~ 1/4 V. Repeat, but add one random edge to the ring graph and verify that the average path length decreases to ~3/16 V.

4.5.20 Add to smallworld.py a function isSmallWorld() that takes a graph as an argument and returns True if the graph exhibits the small-world phenomenon (as defined by the specific thresholds given in the text) and False otherwise.

4.5.21 Implement a test client main() for smallworld.py (PROGRAM 4.5.5) that produces the output given in the sample runs. Your program should take the name of a graph file and a delimiter as command-line arguments; write the number of vertices and edges, the average degree, the average path length, and the clustering coefficient for the graph; and indicate whether the values are too large or too small for the graph to exhibit the small-world phenomenon.

4.5.22 Compose a program to generate random connected graphs and 2-ring graphs with random shortcuts. Using smallworld.py, generate 500 random graphs from both models (with 1,000 vertices each) and compute their average degree, average path length, and clustering coefficient. Compare your results to the corresponding values in the table on page 714.

Image

4.5.23 Compose a smallworld.py and Graph client that generates k-ring graphs and tests whether they exhibit the small-world phenomenon (first do EXERCISE 4.5.20).

4.5.24 In a grid graph, vertices are arranged in an n-by-n grid, with edges connecting each vertex to its neighbors above, below, to the left, and to the right in the grid. Compose a smallworld.py and Graph client that generates grid graphs and tests whether they exhibit the small-world phenomenon (first do EXERCISE 4.5.20).

4.5.25 Extend your solutions to the previous two exercises to also take a command-line argument m and to add m random edges to the graph. Experiment with your programs for graphs with approximately 1,000 vertices to find small-world graphs with relatively few edges.

4.5.26 Compose a Graph and PathFinder that takes the name of a movie-cast file and a delimiter as arguments and writes a new movie-cast file, but with all movies not connected to Kevin Bacon removed.

Creative Exercises

4.5.27 Large Bacon numbers. Find the performers in movies.txt with the largest, but finite, Kevin Bacon number.

4.5.28 Histogram. Compose a program baconhistorgram.py that writes a histogram of Kevin Bacon numbers, indicating how many performers from movies.txt have a Bacon number of 0, 1, 2, 3, .... Include a category for those who have an infinite number (not connected at all to Kevin Bacon).

4.5.29 Performer–performer graph. As mentioned in the text, an alternative way to compute Kevin Bacon numbers is to build a graph where there is a vertex for each performer (but not for each movie), and where two performers are connected by an edge if they appear in a movie together. Calculate Kevin Bacon numbers by running breadth-first search on the performer–performer graph. Compare the running time with the running time on movies.txt. Explain why this approach is so much slower. Also explain what you would need to do to include the movies along the path, as happens automatically with our implementation.

4.5.30 Connected components. A connected component in an undirected graph is a maximal set of vertices that are mutually reachable. Compose a data type ConnectedComponents that computes the connected components of a graph. Include a constructor that takes a Graph as an argument and computes all of the connected components using breadth-first search. Include a method areConnected(v, w) that returns True if v and w are in the same connected component and False otherwise. Also add a method components() that returns the number of connected components.

4.5.31 Flood fill. A Picture is a two-dimensional array of Color values (see SECTION 3.1) that represent pixels. A blob is a collection of neighboring pixels of the same color. Compose a Graph client whose constructor builds a grid graph (see EXERCISE 4.5.24) from a given image and supports the flood fill operation. Given pixel coordinates col and row and a color c, change the color of that pixel and all the pixels in the same blob to c.

4.5.32 Word ladders. Compose a program wordladder.py that takes two 5-letter strings from the command line, reads in a list of 5-letter words from standard input, and writes a shortest word ladder using the words on standard input connecting the two strings (if it exists). Two words can be connected in a word ladder chain if they differ in exactly one letter. As an example, the following word ladder connects green and brown:

green greet great groat groan grown brown

Compose a simple filter to get the 5-letter words from a system dictionary for standard input or download a list from the booksite. (This game, originally known as doublet, was invented by Lewis Carroll.)

4.5.33 All paths. Compose a Graph client class AllPaths whose constructor takes a Graph as an argument and supports operations to count or write all simple paths between two given vertices s and t in the graph. A simple path does not revisit any vertex more than once. In two-dimensional grids, such paths are referred to as self-avoiding walks (see SECTION 1.4). It is a fundamental problem in statistical physics and theoretical chemistry, for example, to model the spatial arrangement of linear polymer molecules in a solution. Warning: There might be exponentially many paths.

4.5.34 Percolation threshold. Develop a graph model for percolation, and compose a Graph client that performs the same computation as percolation.py (PROGRAM 2.4.6). Estimate the percolation threshold for triangular, square, and hexagonal grids.

4.5.35 Subway graphs. In the Tokyo subway system, routes are labeled by letters and stops by numbers, such as G-8 or A-3. Stations allowing transfers are sets of stops. Find a Tokyo subway map on the web, develop a simple database format, and compose a Graph client that reads a file and can answer shortest-path queries for the Tokyo subway system. If you prefer, do the Paris subway system, where routes are sequences of names and transfers are possible when two stations have the same name.

4.5.36 Center of the Hollywood universe. We can measure how good a center Kevin Bacon is by computing each performer’s Hollywood number or average path length. The Hollywood number of Kevin Bacon is the average Bacon number of all the performers (in its connected component). The Hollywood number of another performer is computed the same way, making that performer the source instead of Kevin Bacon. Compute Kevin Bacon’s Hollywood number and find a performer with a better Hollywood number than Kevin Bacon. Find the performers (in the same connected component as Kevin Bacon) with the best and worst Hollywood numbers.

4.5.37 Diameter. The eccentricity of a vertex is the greatest distance between it and any other vertex. The diameter of a graph is the greatest distance between any two vertices (the maximum eccentricity of any vertex). Compose a Graph client diameter.py that can compute the eccentricity of a vertex and the diameter of a graph. Use it to find the diameter of the graph represented by movies.txt.

4.5.38 Directed graphs. Implement a Digraph data type that represents directed graphs, where the direction of edges is significant: addEdge(v, w) means to add an edge from v to w but not from w to v. Replace adjacentTo() with two methods: adjacentFrom(), to give the set of vertices having edges directed to them from the argument vertex, and adjacentTo(), to give the set of vertices having edges directed from them to the argument vertex. Explain how to modify PathFinder to find shortest paths in directed graphs.

4.5.39 Random surfer. Modify your Digraph class of the previous exercise to make a MultiDigraph class that allows parallel edges. For a test client, run a random surfer simulation that matches randomsurfer.py (PROGRAM 1.6.2).

4.5.40 Transitive closure. Compose a Digraph client class TransitiveClosure whose constructor takes a Digraph as an argument and whose method isReachable(v, w) returns True if w is reachable from v along a directed path in the digraph and False otherwise. Hint: Run breadth-first search from each vertex, as in allshortestpaths.py (EXERCISE 4.5.11).

4.5.41 Statistical sampling. Use statistical sampling to estimate the average path length and clustering coefficient of a graph. For example, to estimate the clustering coefficient, pick t random vertices and compute the average of the clustering coefficients of those vertices. The running time of your functions should be orders of magnitude faster than the corresponding functions from smallworld.py.

4.5.42 Cover time. A random walk in a connected undirected graph moves from a vertex to one of its neighbors, each chosen with equal probability. (This process is the random surfer analog for undirected graphs.) Write programs to run experiments that support the development of hypotheses on the number of steps used to visit every vertex in the graph. What is the cover time for a complete graph with V vertices? A ring graph? A 2-ring graph? Can you find a family of graphs where the cover time grows proportionally to V3 or 2V?

4.5.43 Erdös-Renyi random graph model. In the classical random graph model, we build a random graph on V vertices by including each possible edge with probability p, independently of the other edges. Compose a Graph client to verify the following properties:

Connectivity thresholds: If p < 1/V and V is large, then most of the connected components are small, with the largest being logarithmic in size. If p > 1/V, then there is almost surely a giant component containing almost all vertices. If p < ln V / V, the graph is disconnected with high probability; if p > ln V / V, the graph is connected with high probability.

Distribution of degrees: The distribution of degrees follows a binomial distribution, centered on the average, so most vertices have similar degrees. The probability that a vertex is connected to k other vertices decreases exponentially in k.

No hubs: The maximum vertex degree when p is a constant is at most logarithmic in V.

No local clustering: The cluster coefficient is close to 0 if the graph is sparse and connected. Random graphs are not small-world graphs.

Short path lengths: If p > ln V / V, then the diameter of the graph (see EXERCISE 4.5.37) is logarithmic.

4.5.44 Power law of web links. The indegrees and outdegrees of pages in the web obey a power law that can be modeled by a preferred attachment process. Suppose that each web page has exactly one outgoing link. Each page is created one at a time, starting with a single page that points to itself. With probability p < 1, it links to one of the existing pages, chosen uniformly at random. With probability 1–p, it links to an existing page with probability proportional to the number of incoming links of that page. This rule reflects the common tendency for new web pages to point to popular pages. Compose a program to simulate this process and plot a histogram of the number of incoming links.

Partial solution: The fraction of pages with indegree k is proportional to k−1 / (1−p).

4.5.45 Global clustering coefficient. Add a function to smallworld.py that computes the global clustering coefficient of a graph. The global clustering coefficient is the conditional probability that two random vertices that are neighbors of a common vertex are neighbors of each other. Find graphs for which the local and global clustering coefficients are different.

4.5.46 Watts–Strogatz graph model. (See EXERCISES 4.5.24 and 4.5.25.) Watts and Strogatz proposed a hybrid model that contains typical links of vertices near each other (people know their geographic neighbors), plus some random long-range connection links. Plot the effect of adding random edges to an n-by-n grid graph on the average path length and on the cluster coefficient, for n = 100. Do the same for k-ring graphs on V vertices, for V = 10,000 and various values of k up to 10 log V.

4.5.47 Bollobás–Chung graph model. Bollobás and Chung proposed a hybrid model that combines a 2-ring on V vertices (V is even), plus a random matching. A matching is a graph in which every vertex has degree 1. To generate a random matching, shuffle the V vertices and add an edge between vertex i and vertex i+1 in the shuffled order. Determine the degree of each vertex for graphs in this model. Using smallworld.py, estimate the average path length and cluster coefficient for random graphs generated according to this model for V = 1,000.

4.5.48 Kleinberg graph model. There is no way for participants in the Watts-Strogatz model to find short paths in a decentralized network. But Milgram’s experiment also had a striking algorithmic component—individuals can find short paths! Jon Kleinberg proposed making the distribution of shortcuts obey a power law, with probability proportional to the dth power of the distance (in d dimensions). Each vertex has one long-range neighbor. Compose a program to generate graphs according to this model, with a test client that uses smallworld.py to test whether they exhibit the small-world phenomenon. Plot histograms to show that the graphs are uniform over all distance scales (same number of links at distances 1–10 as at distances 10–100 or 100–1000). Compose a program to compute the average lengths of paths obtained by taking the edge that brings the path as close to the target as possible in terms of lattice distance, and test the hypothesis that this average is proportional to (log V)2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset