1.4 Arrays

In this section, we introduce you to the idea of a data structure and to your first data structure, the array. The primary purpose of an array is to facilitate storing and manipulating large quantities of data. Arrays play an essential role in many data processing tasks. They also correspond to vectors and matrices, which are widely used in science and in scientific programming. We will consider basic properties of array processing in Python, with many examples illustrating why arrays are useful.

A data structure is a way to organize data that we wish to process with a computer program. Data structures play an essential role in computer programming—indeed, CHAPTER 4 of this book is devoted to the study of classic data structures of all sorts.

A one-dimensional array (or array) is a data structure that stores a sequence of (references to) objects. We refer to the objects within an array as its elements. The method that we use to refer to elements in an array is numbering and then indexing them. If we have n elements in the sequence, we think of them as being numbered from 0 to n – 1. Then, we can unambiguously specify one of them by referring to the i th element for any integer i in this range.

A two-dimensional array is an array of (references to) one-dimensional arrays. Whereas the elements of a one-dimensional array are indexed by a single integer, the elements of a two-dimensional array are indexed by a pair of integers: the first specifying a row, and the second specifying a column.

Often, when we have a large amount of data to process, we first put all of the data into one or more arrays. Then we use indexing to refer to individual elements and to process the data. We might have exam scores, stock prices, nucleotides in a DNA strand, or characters in a book. Each of these examples involves a large number of objects that are all of the same type. We consider such applications when we discuss data input in SECTION 1.5 and in the case study that is the subject of SECTION 1.6. In this section, we expose the basic properties of arrays by considering examples where our programs first populate arrays with objects having computed values from experimental studies and then process them.

Arrays in Python

The simplest way to create an array in Python is to place comma-separated literals between matching square brackets. For example, the code

suits = ['Clubs', 'Diamonds', 'Hearts', 'Spades']

creates an array suits[] with four strings, and the code

x = [0.30, 0.60, 0.10]
y = [0.40, 0.10, 0.50]

creates two arrays x[] and y[], each with three floats. Each array is an object that contains data (references to objects) structured for efficient access. While the truth is a bit complicated (and explained in detail in SECTION 4.1), it is useful to think of references to the elements in an array as stored contiguously, one after the other, in your computer’s memory, as shown in the diagram at right for the suits[] array defined above.

Image

After creating an array, you can refer to any individual object anywhere you would use a variable name in a program by specifying the array name followed by an integer index within square brackets. In the preceding examples, suits[1] refers to 'Diamonds', x[0] refers to .30, y[2] refers to .50, and so forth. Note that x is a reference to the whole array, as opposed to x[i], which is a reference to the ith element. In the text, we use the notation x[] to indicate that variable x is an array (but we do not use x[] in Python code).

The obvious advantage of using an array is to avoid explicitly naming each variable individually. Using an array index is virtually the same as appending the index to the array name. For example, if we wanted to process eight floats, we could refer to them each individually with variable names like a0, a1, a2, a3, a4, a5, a6, and a7. Naming dozens of individual variables in this way would be cumbersome, however, and naming millions is untenable.

As an example of code that uses arrays, consider using arrays to represent vectors. We consider vectors in detail in SECTION 3.3; for the moment, think of a vector as a sequence of real numbers. The dot product of two vectors (of the same length) is the sum of the products of their corresponding components. For example, if our two example arrays x[] and y[] represent vectors, their dot product is the expression x[0]*y[0] + x[1]*y[1] + x[2]*y[2]. More generally, if we have two one-dimensional arrays of floats x[] and y[] whose length is given by a variable n, we can use a for loop to computer their dot product:

total = 0.0
for i in range(n)
    total += x[i]*y[i]

Image

The simplicity of coding such computations makes the use of arrays the natural choice for all kinds of applications. Before considering more examples, we describe a number of important characteristics of programming with arrays.

Zero-based indexing

We always refer to the first element of an array a[] as a[0], the second as a[1], and so forth. It might seem more natural to refer to the first element as a[1], the second element as a[2], and so forth, but starting the indexing with 0 has some advantages and has emerged as the convention used in most modern programming languages. Misunderstanding this convention often leads to off-by one errors that are notoriously difficult to avoid and debug, so be careful!

Array length

You can access the length of an array using Python’s built-in len() function: len(a) is the number of elements in a[]. Note that the last element of an array a[] is always a[len(a)-1].

Increasing the length of an array at run time

In Python, we can use the += operator to append elements to an array. For example, if a[] is the array [1, 2, 3], then the statement a += [4] extends it to [1, 2, 3, 4]. More generally, we can make an array of n floats, with each element initialized to 0.0, with the code

a = []
for i in range(n)
    a += [0.0]

The statement a = [] creates an empty array (of length 0, with no elements) and the statement a += [0.0] appends one extra element to the end. We note that, in Python, the time required to create an array in this manner is proportional to its length (for details, see SECTION 4.1).

Memory representation

Arrays are fundamental data structures in that they have a direct correspondence with memory systems on virtually all computers. References to the elements of an array are stored contiguously in memory, so that accessing any array element is easy and efficient. Indeed, we can view memory itself as a giant array. On modern computers, memory is implemented in hardware as a sequence of indexed memory locations, each of which can be quickly accessed with an appropriate index. When referring to computer memory, we normally refer to a location’s index as its address. It is convenient to think of the name of the array—say, x—as storing the memory address of a contiguous block of memory containing the length of the array and references to its elements. For the purposes of illustration, suppose that the computer’s memory is organized as 1,000 values, with addresses from 000 to 999. Now, suppose that an array x[] of three elements is stored in memory locations 523 through 526, with the length stored in 523 and references to the array elements stored in 524 through 526. When we specify x[i], Python generates code that adds the index i to the memory address of the first element of the array. In the example pictured here, the Python code x[2] would generate machine code that finds the reference at memory location 524 + 2 = 526. The same simple method is effective even when the memory and the array (and i) are huge. Accessing a reference to element i of an array is an efficient operation because it simply requires adding two integers and then referencing memory—just two elementary operations.

Image
Bounds checking

As already indicated, you must be careful when programming with arrays. It is your responsibility to use legal indices when accessing an array element. If you have created an array of size n and use an index whose value is greater than n-1, your program will raise an IndexError at run time. (In many programming languages, such buffer overflow conditions are not checked by the system. These kinds of unchecked errors can and do lead to debugging nightmares, but it is also not uncommon for such an error to go unnoticed and remain in a finished program. You might be surprised to know that such a mistake can be exploited by a hacker to take control of a system—even your personal computer—to spread viruses, steal personal information, or wreak other malicious havoc.) The error messages provided by Python may seem annoying to you at first, but they are small price to pay to have a more secure program.

Mutability

An object is mutable if its value can change. Arrays are mutable objects because we can change their elements. For example, if we create an array with the code x = [.30, .60, .10], then the assignment statement x[1] = .99 changes it to the array [.30, .99, .10]. An object-level trace of this operation is shown at right.

Image

Often the point of a piece of code is to rearrange the elements in an array. For example, the following code reverses the order of the elements in an array a[]:

n = len(a)
for i in range(n // 2):
    temp = a[i]
    a[i] = a[n-1-i]
    a[n-1-i] = temp

The three statements in the for loop implement the exchange operation that we studied when first learning about assignment statements in SECTION 1.2. You might wish to check your understanding of arrays by studying the informal trace of this code for a seven-element array [3, 1, 4, 1, 5, 9, 2], shown at the right. This table keeps track of i and all seven array elements at the end of each iteration of the for loop.

It is natural to expect mutability in arrays, but, as you will see, mutability is a key issue in data type design and has a number of interesting implications. We will discuss some of these implications later in this section, but defer most of the discussion to SECTION 3.3.

Image
Iteration

One of the most basic operations on an array is to iterate over all its elements. For example, the following code computes the average of an array of floats:

total = 0.0
for i in range(len(a))
    total += a[i]
average = total / len(a)

Python also supports iterating over the elements in an array in a[] without referring to the indices explicitly. To do so, put the array name after the in keyword in a for statement, as follows:

total = 0.0
for v in a:
    total += v
average = total / len(a)

Python successively assigns each element in the array to the loop-control variable v, so this code is essentially equivalent to the code given earlier. In this book, we iterate over the indices when we need to refer to the array elements by their indices (as in the dot-product and array-reversal examples just considered). We iterate over the elements when only the sequence of elements matters, and not the indices (as in this compute-the-average example).

Built-in functions

Python has several built-in functions that can take arrays as arguments. We have already discussed the len() function. As another example, if the elements of a[] are numeric, then sum(a) computes their sum, so that we can compute their average with float(sum(a)) / len(a) instead of using either of the loops just described. Other useful built-in functions that can take arrays as arguments are min() for computing the minimum and max() for computing the maximum.

Writing an array

You can write an array by passing it as an argument to stdio.write() or stdio.writeln(). The array is written with a leading open square bracket, followed by the array’s objects separated with commas and spaces, followed by a trailing close square bracket, all on a single line. Each object in the array is converted to a string. If this format is not suited to your needs, you can use a for statement to write each array element individually.

Array aliases and copies

Before looking at programs that use arrays, it is worthwhile to examine two fundamental array-processing operations in more detail. The points that we discuss are subtle, but they are more important than they might seem at first glance. If you find yourself becoming a bit overwhelmed by technical details, you might prefer to return to these two pages after studying the applications programs that use arrays later in this section. Rereading this material to cement your understanding of these concepts is certain to pay dividends.

Aliasing

We can use variables that reference arrays in much the same way in our code that we can use variables that reference other types of objects, but it is very important to take the time to understand the effects of such usage. If you reflect on the situation, perhaps the first question that arises is this: What happens when you use an array variable name on the left side of an assignment statement? In other words, if x[] and y[] are arrays, what is the effect of the statement x = y? The answer is simple and consistent with Python usage for other data types: x and y reference the same array. But this result has an effect that is perhaps unexpected, at first, because it is natural to think of x and y as references to two independent arrays, which is not the case. For example, after the assignment statements

x = [.30, .60, .10]
y = x
x[1] = .99

Image

y[1] is also .99, even though the code does not refer directly to y[1]. This situation—whenever two variables refer to the same object—is known as aliasing, and is illustrated in the object-level trace at right. We avoid aliasing arrays (and other mutable objects) because it makes it more difficult to find errors in our programs. Still, as you will see in CHAPTER 2, there are natural situations where aliasing two arrays is helpful, so it is worthwhile for you to be aware of this concept at the outset.

Copying and slicing

The next natural question you might ask is the following: How do we arrange to make a copy y[] of a given array x[]? One answer to this question is to iterate through x[] to build y[], as in the following code:

y = []
for v in x:
    y += [v]

After the copy operation, x and y refer to two different arrays. If we change the object that x[1] references, that change has no effect on y[1]. This situation is illustrated in the object-level trace at right.

Actually, copying an array is such a useful operation that Python provides language support for a more general operation known as slicing, where we can copy any contiguous sequence of elements within an array to another array. The expression a[i:j] evaluates to a new array whose elements are a[i], ..., a[j-1]. Moreover, the default value for i is 0 and the default value for j is len(a), so

y = x[:]

is equivalent to the code given earlier. This compact notation is convenient, but it is important for you to remember that it masks a potentially expensive operation—it takes time proportional to the length of x[].

System support for arrays

Python code for processing arrays can take many forms. We describe each briefly for context, but our emphasis for most of this book is on the few operations that you need to know to write meaningful and effective code. After you have some experience reading and composing code using these basic operations, you will better be able to understand the differences among the various approaches that are available for processing arrays in Python, so we revisit this topic in CHAPTER 4.

Image
Python’s built-in list data type

In its most basic form, an array supports four core operations: creation, indexed access, indexed assignment, and iteration. In this book, we use Python’s built-in list data type for arrays because it supports these basic operations. We consider more elaborate operations supported by Python’s list data type in CHAPTER 4, including array-resizing operations that change the length of the array. Arrays, as defined here, are sufficiently powerful that you will be able to learn a great deal about programming and approach a number of interesting applications in the meantime. Python programmers typically do not make a distinction between arrays and Python lists. We make this distinction here because many other programming languages (such as Java and C) provide built-in support for fixed-length arrays and the four core operations (but not for the more elaborate array-resizing operations).

Python’s numpy module

The essence of programming language design is a tradeoff between simplicity and efficiency. Even though computers might seem to have the capability to perform huge numbers of elementary operations per second, you know that they can seem slow sometimes. In the case of Python, its built-in list data type can have severe performance problems, even for simple programs that use arrays to solve real-world problems. For that reason, scientists and engineers often use a Python extension module called numpy for processing huge arrays of numbers, because that module uses a lower-level representation that avoids many of the inefficiencies in the standard Python representation. Again we will delve deeply into understanding performance in CHAPTER 4, so that you may have a better understanding of how to approach such situations. In that context, we describe the use of numpy on the booksite.

Our stdarray module

Earlier in this chapter, we introduced our stdio module, which defines functions related to writing and reading integers, floats, strings, and booleans, and you have used the stdio.write() and stdio.writeln() functions extensively. The stdio module is a booksite module; that is, it is a nonstandard module that we composed specifically to support this book and its booksite. Now, we introduce another booksite module: the stdarray module. Its primary purpose is to define functions for processing arrays that we use throughout the book.

A fundamental operation that is found in nearly every array-processing program is to create an array of n elements, each initialized to a given value. As we have seen, you can do this in Python with code like the following:

a = []
for i in range(n):
    a += [0.0]

Creating an array of a given length and initializing all of its elements to a given value is so common that Python even has a special shorthand notation for it: the code a = [0.0]*n is equivalent to the code just given. Rather than repeat such code throughout the book, we will use code like this:

a = stdarray.create1D(n, 0.0)

For consistency, stdarray also includes a create2D() function, which we will examine later in this section. We use these functions because they make our code “self-documenting” in the sense that the code itself precisely indicates our intent and does not depend on idioms specific to Python. We will delve deeply into issues surrounding library design in SECTION 2.2 (and show you how to create modules like stdarray yourself), again so that you may have a better understanding of how to approach such situations.

Image

You now know how to create arrays in Python and access individual elements, and you have a basic understanding of Python’s built-in list data type. The table at the top of the next page describes the array operations that we have discussed. With this conceptual foundation complete, we can now move to applications. As you will see, this foundation leads to applications code that is not only easy to compose and understand, but also makes efficient use of computational resources.


Abbreviation alert

Throughout this book, we refer to Python lists as arrays because Python lists support the fundamental operations that characterize arrays—creation, indexed access, indexed assignment, and iteration.


Image

Sample applications of arrays

Next, we consider a number of applications that illustrate the utility of arrays and also are interesting in their own right.

Representing playing cards

Suppose that we want to compose programs that process playing cards. We might start with the following code:

SUITS = ['Clubs', 'Diamonds', 'Hearts', 'Spades']
RANKS = ['2', '3', '4', '5', '6', '7', '8', '9', '10',
         'Jack', 'Queen', 'King', 'Ace']

For example, we might use these two arrays to write a random card name, such as Queen of Clubs, as follows:

rank = random.randrange(0, len(RANKS))
suit = random.randrange(0, len(SUITS))
stdio.writeln(RANKS[rank] + ' of ' + SUITS[suit])

A more typical situation is when we compute the values to be stored in an array. For example, we might use the following code to initialize an array of length 52 that represents a deck of playing cards, using the two arrays just defined:

deck = []
for rank in RANKS:
    for suit in SUITS:
        card = rank + ' of ' + suit
        deck += [card]

After this code has been executed, writing the elements of deck[] in order from deck[0] through deck[51] with one element per line gives the sequence

2 of Clubs
2 of Diamonds
2 of Hearts
2 of Spades
3 of Clubs
3 of Diamonds
...
Ace of Hearts
Ace of Spades

Exchange

Frequently, we wish to exchange two elements in an array. Continuing our example with playing cards, the following code exchanges the cards at indices i and j using the same idiom that we used previously in this section to reverse an array:

temp = deck[i]
deck[i] = deck[j]
deck[j] = temp

When we use this code, we are assured that we are perhaps changing the order of the elements in the array but not the set of elements in the array. When i and j are equal, the array is unchanged. When i and j are not equal, the values a[i] and a[j] are found in different places in the array. For example, if we were to use this code with i equal to 1 and j equal to 4 in the deck[] of the previous example, it would leave the string '3 of Clubs' in deck[1] and the string '2 of Diamonds' in deck[4].

Shuffle

The following code shuffles our deck of cards:

n = len(deck)
for i in range(n):
    r = random.randrange(i, n)
    temp = deck[r]
    deck[r] = deck[i]
    deck[i] = temp

Proceeding from left to right, we pick a random card from deck[i] through deck[n-1] (each card equally likely) and exchange it with deck[i]. This code is more sophisticated than it might seem. First, we ensure that the cards in the deck after the shuffle are the same as the cards in the deck before the shuffle by using the exchange idiom. Second, we ensure that the shuffle is random by choosing uniformly from the cards not yet chosen. Incidentally, Python includes a standard function named shuffle() in the random module that uniformly shuffles a given array, so the function call random.shuffle(deck) does the same job as this code.

Sampling without replacement

In many situations, we want to draw a random sample from a set such that each element in set appears at most once in the sample. Drawing numbered ping-pong balls from a basket for a lottery is an example of this kind of sample, as is dealing a hand from a deck of cards. PROGRAM 1.4.1 (sample.py) illustrates how to sample, using the basic operation underlying shuffling. It takes command-line arguments m and n and creates a permutation of size n (a rearrangement of the integers from 0 to n-1) whose first m elements constitute a random sample.

The accompanying trace of the contents of the perm[] array at the end of each iteration of the main loop (for a run where the values of m and n are 6 and 16, respectively) illustrates the process. If the values of r are chosen such that each value in the given range is equally likely, then perm[0] through perm[m-1] are a random sample at the end of the process (even though some elements might move multiple times) because each element in the sample is chosen by taking each item not yet sampled, with each choice having equal probability of being selected.

One important reason to explicitly compute the permutation is that we can use it to write a random sample of any array by using the elements of the permutation as indices into the array. Doing so is often an attractive alternative to actually rearranging the array, because it may need to be in order for some other reason (for instance, a company might wish to draw a random sample from a list of customers that is kept in alphabetical order). To see how this trick works, suppose that we wish to draw a random poker hand from our deck[] array, constructed as just described. We use the code in sample.py with m = 5 and n = 52 and replace perm[i] with deck[perm[i]] in the stdio.write() statement (and change it to writeln()), resulting in output such as the following:

3 of Clubs
Jack of Hearts
6 of Spades
Ace of Clubs
10 of Diamonds


Program 1.4.1 Sampling without replacement (sample.py)

import random
import sys
import stdarray
import stdio

m = int(sys.argv[1])   # Choose this many elements
n = int(sys.argv[2])   # from 0, 1, ..., n-1.

# Initialize array perm = [0, 1, ..., n-1].
perm = stdarray.create1D(n, 0)
for i in range(n):
    perm[i] = i

# Create a random sample of size m in perm[0..m).
for i in range(m):
    r = random.randrange(i, n)

    # Exchange perm[i] and perm[r].
    temp    = perm[r]
    perm[r] = perm[i]
    perm[i] = temp

# Write the results.
for i in range(m):
    stdio.write(str(perm[i]) + ' ')
stdio.writeln()


  m    | sample size
  n    | range
perm[] | permutation of 0 to n-1

This program accepts integers m and n as command-line arguments, and writes a random sample of m integers in the range 0 to n-1 (no duplicates). This process is useful not just in state and local lotteries, but in scientific applications of all sorts. If the first argument is less than or equal to the second, the result is a random permutation of the integers from 0 to n-1. If the first argument is greater than the second, the program raises a ValueError at run time.


% python sample.py 6 16
9 5 13 1 11 8

% python sample.py 10 1000
656 488 298 534 811 97 813 156 424 109

% python sample.py 20 20
6 12 9 8 13 19 0 2 4 5 18 1 14 16 17 3 7 11 10 15


Sampling like this is widely used as the basis for statistical studies in polling, scientific research, and many other applications, whenever we want to draw conclusions about a large population by analyzing a small random sample. Python includes a standard function in the random module for sampling: given an array a[] and an integer k, the function call random.sample(a, k) returns a new array containing a sample of size k, chosen uniformly at random among the elements in a[].

Image
Precomputed values

Another application of arrays is to save values that you have computed for later use. As an example, suppose that you are composing a program that performs calculations using small values of the harmonic numbers (see PROGRAM 1.3.5). An efficient approach is to save the values in an array, as follows:

harmonic = stdarray.create1D(n+1, 0.0)
for i in range(1, n+1):
    harmonic[i] = harmonic[i-1] + 1.0/i

Note that we waste one slot in the array (element 0) to make harmonic[1] correspond to the first harmonic number 1.0 and harmonic[i] correspond to the ith harmonic number. Precomputing values in this way is an example of a spacetime tradeoff: by investing in space (to save the values), we save time (since we do not need to recompute them). This method is not effective if we need values for huge n, but it is very effective if we need values for small n many different times.

Simplifying repetitive code

As an example of another simple application of arrays, consider the following code fragment, which writes the name of a month given its number (1 for January, 2 for February, and so forth):

 if   m ==  1: stdio.writeln('Jan')
 elif m ==  2: stdio.writeln('Feb')
 elif m ==  3: stdio.writeln('Mar')
 elif m ==  4: stdio.writeln('Apr')
...
 elif m == 11: stdio.writeln('Nov')
 elif m == 12: stdio.writeln('Dec')

A more compact alternative is to use an array of strings holding the month names:

MONTHS = ['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
...
stdio.writeln(MONTHS[m])

This technique would be especially useful if you needed to access the name of a month by its number in several different places in your program. Note that, again, we intentionally waste one slot in the array (element 0) to make MONTHS[1] correspond to January, as required.

With these basic definitions and examples out of the way, we can now consider two applications that both address interesting classical problems and illustrate the fundamental importance of arrays in efficient computation. In both cases, the idea of using data to efficiently index into an array plays a central role and enables a computation that would not otherwise be feasible.

Coupon collector

Suppose that you have a deck of cards and you pick cards at random (with replacement) one by one. How many cards do you need to turn up before you have seen one of each suit? How many cards do you need to turn up before seeing one of each rank? These are examples of the famous coupon collector problem. In general, suppose that a trading card company issues trading cards with n different possible cards: how many do you have to collect before you have all n possibilities, assuming that each possibility is equally likely for each card that you collect?

Image

PROGRAM 1.4.2 (couponcollector.py) is an example program that simulates this process and illustrates the utility of arrays. It takes the integer n from the command line and generates a sequence of random integer values between 0 and n-1 using the function call random.randrange(0, n) (see PROGRAM 1.3.8). Each value represents a card; for each card, we want to know if we have seen that value before. To maintain that knowledge, we use an array isCollected[], which uses the card value as an index: isCollected[i] is True if we have seen a card with value i and False if we have not. When we get a new card that is represented by the integer value, we check whether we have seen its value before simply by accessing isCollected[value]. The computation consists of keeping count of the number of distinct values seen and the number of cards generated and writing the latter when the former reaches n.

As usual, the best way to understand a program is to consider a trace of the values of its variables for a typical run. It is easy to add code to couponcollector.py that produces a trace that gives the values of the variables at the end of the while loop. In the table at right, we use F for False and T for True to make the trace easier to follow. Tracing programs that use large arrays can be a challenge: when you have an array of length n in your program, it represents n variables, so you have to list them all. Tracing programs that use random.randrange() also can be a challenge because you get a different trace every time you run the program. Accordingly, we check relationships among variables carefully. Here, note that collectedCount always is equal to the number of True values in isCollected[].

Image

Program 1.4.2 Coupon collector simulation (couponcollector.py)

import random
import sys
import stdarray
import stdio

n = int(sys.argv[1])

count = 0
collectedCount = 0
isCollected = stdarray.create1D(n, False)

while collectedCount < n:
    # Generate another coupon.
    value = random.randrange(0, n)
    count += 1
    if not isCollected[value]:
        collectedCount += 1
        isCollected[value] = True

stdio.writeln(count)


      n        | # of coupon values (0 to n-1)
    count      | # of coupons collected
collectedCount | # of distinct coupons collected
isCollected[i] | has coupon i been collected?
     value     | value of current coupon

This program accepts an integer n as a command-line argument, and writes the number of coupons collected before obtaining one of each of n types. Thus it simulates a coupon collector.


% python couponcollector.py 1000
6583

% python couponcollector.py 1000
6477

% python couponcollector.py 1000000
12782673


Coupon collecting is no trivial problem. For example, it is very often the case that scientists want to know whether a sequence that arises in nature has the same characteristics as a random sequence. If so, that fact might be of interest; if not, further investigation may be warranted to look for patterns that might be of importance. For example, such tests are used by scientists to decide which parts of a genome are worth studying. One effective test of whether a sequence is truly random is the coupon collector test: compare the number of elements that need to be examined before all values are found against the corresponding number for a uniformly random sequence.

Without arrays, we could not contemplate simulating the coupon collector process for huge n; with arrays, it is easy to do so. We will see many examples of such processes throughout the book.

Sieve of Eratosthenes

Prime numbers play an important role in mathematics and computation, including cryptography. A prime number is an integer greater than 1 whose only positive divisors are 1 and itself. The prime counting function π(n) is the number of primes less than or equal to n. For example, π(25) = 9 because the first nine primes are 2, 3, 5, 7, 11, 13, 17, 19, and 23. This function plays a central role in number theory.

We might use a program like factors.py (PROGRAM 1.3.9) to count primes. Specifically, we could modify the code in factors.py to set a boolean variable to True if a given number is prime and False otherwise (instead of writing out factors), then enclose that code in a loop that increments a counter for each prime number. This approach is effective for small n, but becomes too slow as n grows.

PROGRAM 1.4.3 (primesieve.py) is an alternative that computes π(n) using a technique known as the sieve of Eratosthenes. The program uses an array isPrime[] of n booleans to record which of the integers less than or equal to n are prime. The goal is to set isPrime[i] to True if i is prime, and to False otherwise. The sieve works as follows: Initially, the program assigns True to all elements of the array, indicating that no factors of any integer have yet been found. Then, it repeats the following steps as long as i < n:

• Find the next smallest i for which no factors have been found.

• Leave isPrime[i] as True since i has no smaller factors.

• Assign False to all isPrime[] elements whose indices are multiples of i.


Program 1.4.3 Sieve of Eratosthenes (primesieve.py)

import sys
import stdarray
import stdio

n = int(sys.argv[1])

isPrime = stdarray.create1D(n+1, True)

for i in range(2, n):
    if (isPrime[i]):
        # Mark multiples of i as nonprime.
        for j in range(2, n//i + 1):
            isPrime[i*j] = False
# Count the primes.
count = 0
for i in range(2, n+1):
    if (isPrime[i]):
        count += 1
stdio.writeln(count)


    n      | argument
isPrime[i] | is i prime?
   count   | prime counter


This program takes a command-line argument n and computes the number of primes less than or equal to n. To do so, it computes an array of booleans with isPrime[i] set to True if i is prime, and to False otherwise.


% python primesieve.py 25
9

% python primesieve.py 100
25

% python primesieve.py 10000
1229

% python primesieve.py 1000000
78498

% python primesieve.py 100000000
5761455


Image

When the nested for loop ends, we have set the isPrime[] elements for all nonprimes to be False and have left the isPrime[] elements for all primes as True. Note that we might stop when i*i >= n, just as we did for factors.py, but the savings in this case would be marginal at best, since the inner for loop does not iterate at all for large i. With one more pass through the array, we can count the number of primes less than or equal to n.

As usual, it is easy to add code to write a trace. For programs such as primesieve.py, you have to be a bit careful—it contains a nested for-if-for, so you have to pay attention to the indentation to put the tracing code in the correct place.

With primesieve.py, we can compute π(n) for large n, limited primarily by the maximum array size allowed by Python. This is another example of a space–time tradeoff. Programs like primesieve.py play an important role in helping mathematicians to develop the theory of numbers, which has many important applications.

Two-dimensional arrays

In many applications, a convenient way to store information is to use a table of numbers organized in a rectangular table and refer to rows and columns in the table. For example, a teacher might need to maintain a table with a row corresponding to each student and a column corresponding to each assignment, a scientist might need to maintain a table of experimental data with rows corresponding to experiments and columns corresponding to various outcomes, or a programmer might want to prepare an image for display by setting a table of pixels to various grayscale values or colors.

Image

The mathematical abstraction corresponding to such tables is a matrix; the corresponding data structure is a two-dimensional array. You are likely to have already encountered many applications of matrices and two-dimensional arrays, and you will certainly encounter many others in science, in engineering, and in commercial applications, as we will demonstrate with examples throughout this book. As with vectors and one-dimensional arrays, many of the most important applications involve processing large amounts of data, and we defer considering those applications until we consider input and output, in SECTION 1.5.

Extending the one-dimensional array data structure that we have discussed to two-dimensional arrays is straightforward, once you realize that since an array can contain any type of data, its elements can also be arrays! That is, a two-dimensional array is implemented as an array of one-dimensional arrays, as detailed next.

Initialization

The simplest way to create a two-dimensional array in Python is to place comma-separated one-dimensional arrays between matching square brackets. For example, this matrix of integers having two rows and three columns

18 19 20
21 22 23

could be represented in Python using this array of arrays:

a = [[18, 19, 20], [21, 22, 23]]

We call such an array a 2-by-3 array. By convention, the first dimension is the number of rows and the second dimension is the number of columns. Python represents a 2-by-3 array as an array that contains two objects, each of which is an array that contains three objects.

Image

More generally, Python represents an m-by-n array as an array that contains m objects, each of which is an array that contains n objects. For example, this Python code creates an m-by-n array a[][] of floats, with all elements initialized to 0.0:

a = []
for i in range(m):
    row = [0.0] * n
    a += [row]

As for one-dimensional arrays, we use the self-descriptive alternative

stdarray.create2D(m, n, 0.0)

from our booksite module stdarray throughout this book.

Indexing

When a[][] is a two-dimensional array, the syntax a[i] denotes a reference to its ith row. For example, if a[][] is the array [[18, 19, 20], [21 22, 23]], then a[1] is the array [21, 22, 23]. It is more common to refer to a particular element in a two-dimensional array. The syntax a[i][j] refers to the object at row i and column j of the two-dimensional array a[][]. In our example, a[1][0] is 21. To access each of the elements in a two-dimensional array, we use two nested for loops. For example, this code writes each object of the m-by-n array a[][], one row per line.

for i in range(m):
    for j in range(n):
        stdio.write(a[i][j])
        stdio.write(' ')
    stdio.writeln()

This code achieves the same effect without using indices:

for row in a:
    for v in row:
        stdio.write(v)
        stdio.write(' ')
    stdio.writeln()

Spreadsheets

One familiar use of arrays is a spreadsheet for maintaining a table of numbers. For example, a teacher with m students and n test grades for each student might maintain an (m +1)-by-(n +1) array, reserving the last column for each student’s average grade and the last row for the average test grades. Even though we typically do such computations within specialized applications, it is worthwhile to study the underlying code as an introduction to array processing. To compute the average grade for each student (average values for each row), sum the elements for each row and divide by n. The row-by-row order in which this code processes the matrix elements is known as row-major order. Similarly, to compute the average test grade (average values for each column), sum the elements for each column and divide by n. The column-by-column order in which this code processes the matrix elements is known as column-major order. These operations are all illustrated in the figure at the top of the this page. To allow for half points, our teacher records all grades as floats.

Image
Matrix operations

Typical applications in science and engineering involve representing matrices as two-dimensional arrays and then implementing various mathematical operations with matrix operands. Again, even though such processing is often done within specialized applications and libraries, it is worthwhile for you to understand the underlying computation.

For example, we can add two n-by-n matrices a[][] and b[][] as follows:

c = stdarray.create2D(n, n, 0.0)
for i in range(n):
    for j in range(n):
        c[i][j] = a[i][j] + b[i][j]

Similarly, we can multiply two matrices. You may have learned matrix multiplication, but if you do not recall or are not familiar with it, the following Python code for square matrices is essentially the same as the mathematical definition. Each element c[i][j] in the product of a[][] and b[][] is computed by taking the dot product of row i of a[][] with column j of b[][].

Image

c = stdarray.create2D(n, n, 0.0)
for i in range(n):
    for j in range(n):
        # Compute the dot product of row i and column j
        for k in range(n):
            c[i][j] += a[i][k] * b[k][j]

The definition extends to matrices that may not be square (see EXERCISE 1.4.17).

Special cases of matrix multiplication

Two special cases of matrix multiplication are important. These special cases occur when one of the dimensions of one of the matrices is 1, so we can view it as a vector. In matrixvector multiplication, we multiply an m-by-n matrix by a column vector (an n-by-1 matrix) to get an m-by-1 column vector result (each element in the result is the dot product of the corresponding row in the matrix with the operand vector). In vectormatrix multiplication, we multiply a row vector (a 1-by-m matrix) by an m-by-n matrix to get a 1-by-n row vector result (each element in the result is the dot product of the operand vector with the corresponding column in the matrix).

Image

These operations provide a succinct way to express numerous matrix calculations. For example, the row-average computation for such a spreadsheet with m rows and n columns is equivalent to a matrix–vector multiplication where the row vector has n elements all equal to 1.0 / n. Similarly, the column-average computation in such a spreadsheet is equivalent to a vector-matrix multiplication where the column vector has m elements all equal to 1.0 / m. We return to vector–matrix multiplication in the context of an important application at the end of this chapter.

Ragged arrays

There is actually no requirement that all rows in a two-dimensional array have the same length. An array with rows of nonuniform length is known as a ragged array (see EXERCISE 1.4.32 for an example application). The possibility of ragged arrays creates the need for taking more care in crafting array-processing code. For example, this code writes the contents of a ragged array:

for i in range(len(a)):
    for j in range(len(a[i])):
        stdio.write(a[i][j])
        stdio.write(' ')
    stdio.writeln()

This code tests your understanding of Python arrays, so you should take the time to study it. In this book, we normally use square or rectangular arrays, whose dimensions are given by the variable m and n. Code that uses len(a[i]) in this way is a clear signal to you that an array is ragged.

Note that the equivalent code that does not use indices works equally well with both rectangular and ragged arrays:

for row in a:
    for v in row:
        stdio.write(v)
        stdio.write(' ')
    stdio.writeln()

Multidimensional arrays

The same notation extends to allow us to compose code using arrays that have any number of dimensions. Using arrays of arrays of arrays..., we can create three-dimensional arrays, four-dimensional arrays, and so forth, and then refer to an individual element with code like a[i][j][k].

Two-dimensional arrays provide a natural representation for matrices, which are omnipresent in science, mathematics, and engineering. They also provide a natural way to organize large amounts of data—a key factor in spreadsheets and many other computing applications. Through Cartesian coordinates, two- and three-dimensional arrays provide the basis for a models of the physical world. With all of these natural applications, arrays will provide fertile ground for interesting examples throughout this book as you learn to program.

Example: self-avoiding random walks

Suppose that you leave your dog in the middle of a large city whose streets form a familiar grid pattern. We assume that there are n north–south streets and n east–west streets, all regularly spaced and fully intersecting in a pattern known as a lattice. Trying to escape the city, the dog makes a random choice of which way to go at each intersection, but knows by scent to avoid visiting any place previously visited. But it is possible for the dog to get stuck in a dead end where there is no choice but to revisit some intersection. What is the chance that this will happen? This amusing problem is a simple example of a famous model known as the self-avoiding random walk, which has important scientific applications in the study of polymers and in statistical mechanics, among many others. For example, you can see that this process models a chain of material growing a bit at a time, until no growth is possible. To better understand such processes, scientists seek to understand the properties of self-avoiding walks.

The dog’s escape probability is certainly dependent on the size of the city. In a tiny 5-by-5 city, it is easy to convince yourself that the dog is certain to escape. But what are the chances of escape when the city is large? We are also interested in other parameters. For example, how long is the dog’s path, on average? How often does the dog come within one block of a previous position other than the one just left, on average? How often does the dog come within one block of escaping? These sorts of properties are important in the various applications just mentioned.

PROGRAM 1.4.4 (selfavoid.py) is a simulation of this situation that uses a two-dimensional boolean array, where each element represents an intersection. True indicates that the dog has visited the intersection; False indicates that the dog has not visited the intersection. The path starts in the center and takes random steps to places not yet visited until the dog gets stuck or escapes at a boundary. For simplicity, the code is written so that if a random choice is made to go to a spot that has already been visited, it takes no action, trusting that some subsequent random choice will find a new place (which is assured because the code explicitly tests for a dead end and exits the loop in that case).

Image

Note that the code exhibits an important programming technique in which we code the loop-continuation test in the while statement as a guard against an illegal statement in the body of the loop. In this case, the loop-continuation test serves as a guard against an out-of-bounds array access within the loop. This corresponds to checking whether the dog has escaped. Within the loop, a successful dead-end test results in a break out of the loop.

As you can see from the sample runs, the unfortunate truth is that your dog is nearly certain to get trapped in a dead end in a large city. If you are interested in learning more about self-avoiding walks, you can find several suggestions in the exercises. For example, the dog is almost certain to escape in the three-dimensional version of the problem. While this is an intuitive result that is confirmed by our simulations, the development of a mathematical model that explains the behavior of self-avoiding walks is a famous open problem: despite extensive research, no one knows a succinct mathematical expression for the escape probability, the average length of the path, or any other important parameter.


Program 1.4.4 Self-avoiding random walks (selfavoid.py)

import random
import sys
import stdarray
import stdio

n      = int(sys.argv[1])
trials = int(sys.argv[2])
deadEnds = 0
for t in range(trials):
    a = stdarray.create2D(n, n, False)
    x = n // 2
    y = n // 2
    while (x > 0) and (x < n-1) and (y > 0) and (y < n-1):
        # Check for dead end and make a random move.
        a[x][y] = True
        if a[x-1][y] and a[x+1][y] and a[x][y-1] and a[x][y+1]:
            deadEnds += 1
            break
        r = random.randrange(1, 5)
        if   (r == 1) and (not a[x+1][y]): x += 1
        elif (r == 2) and (not a[x-1][y]): x -= 1
        elif (r == 3) and (not a[x][y+1]): y += 1
        elif (r == 4) and (not a[x][y-1]): y -= 1

stdio.writeln(str(100*deadEnds//trials) + '% dead ends')


   n     | lattice size
 trials  | # of trials
deadEnds | # of trials with a dead end
 a[][]   | intersections
  x, y   | current position
   r     | random number in [1, 5)

This program accepts integers n and trials as command-line arguments. It performs trials experiments, each a random self-avoiding walk in an n-by-n lattice, and then writes the percentage of dead ends encountered. For each walk, it creates an array of booleans, starts the walk in the center, and continues until reaching either a dead end or a boundary.


% python selfavoid.py 5 100
0% dead ends
% python selfavoid.py 20 100
35% dead ends
% python selfavoid.py 40 100
80% dead ends
% python selfavoid.py 80 100
98% dead ends


% python selfavoid.py 5 1000
0% dead ends
% python selfavoid.py 20 1000
32% dead ends
% python selfavoid.py 40 1000
76% dead ends
% python selfavoid.py 80 1000
98% dead ends


Image

Summary

Arrays are the fourth basic construct (the first three are assignments, conditionals, and loops) found in almost every programming language. As you have seen with the sample programs presented in this section, you can compose programs that can solve all sorts of problems using just these constructs. In the next section, we consider mechanisms for communicating with our programs, which will complete our coverage of elementary programming mechanisms.

Arrays are prominent in many of the programs that we consider, and the basic operations that we have discussed here will serve you well in addressing many programming tasks. When you are not using arrays explicitly (and you are sure to be doing so frequently), you will be using them implicitly, because all computers have a memory that is conceptually equivalent to an array.

The fundamental ingredient that arrays add to our programs is a potentially huge increase in the size of a program’s state. The state of a program can be defined as the information you need to know to understand what a program is doing. In a program without arrays, if you know the values of the variables and which statement is the next to be executed, you can normally determine what the program will do next. When we trace a program, we are essentially tracking its state. When a program uses arrays, however, there can be too huge a number of values (each of which might be changed in each statement) for us to effectively track them all. This difference makes composing programs with arrays more of a challenge than composing programs without them.

Arrays directly represent vectors and matrices, so they are of direct use in computations associated with many basic problems in science and engineering. Arrays also provide a succinct notation for manipulating a potentially huge amount of data in a uniform way, so they play a critical role in any application that involves processing large amounts of data, as you will see throughout this book.

More importantly, arrays represent the tip of an iceberg. They are your first example of a data structure, where we organize data to enable us to conveniently and efficiently process it. We will consider more examples of data structures in CHAPTER 4, including resizing arrays, linked lists, and binary search trees, and hash tables. Arrays are also your first example of a collection, which groups many elements into a single unit. We will consider many more example of collections in CHAPTER 4, including Python lists, tuples, dictionaries, and sets, along with two classic examples, stacks and queues.

Q&A

Q. Why do Python array indices start at 0 instead of 1?

A. That convention originated with machine-language programming, where the address of an array element would be computed by adding the index to the address of the beginning of an array. Starting indices at 1 would entail either a waste of space at the beginning of the array or string, or a waste of time to subtract the 1.

Q. What happens if I use a negative integer to index an array?

A. The answer may surprise you. Given an array a[], you can use the index -i as shorthand for len(a) - i. For example, you can refer to the last element in the array with a[-1] or a[len(a) - 1] and the first element with a[-len(a)] or a[0]. Python raises an IndexError at run time if you use an index outside of the range -len(a) through len(a) - 1.

Q. Why does the slice a[i:j] include a[i] but exclude a[j]?

A. The notation is consistent with ranges defined with range(), which includes the left endpoint but excludes the right endpoint. It leads to some appealing properties: j - i is the length of the subarray (assuming no truncation); a[0:len(a)] is the entire array; a[i:i] is the empty array; and a[i:j] + a[j:k] is the subarray a[i:k].

Q. What happens when I compare two arrays a[] and b[] with (a == b)?

A. It depends. For arrays (or multidimensional arrays) of numbers, it works as you might expect: the arrays are equal if each has the same length and the corresponding elements are equal.

Q. What happens when a random walk does not avoid itself?

A. This case is well understood. It is a two-dimensional version of the gambler’s ruin problem, as described in SECTION 1.3.

Q. Which pitfalls should I watch out for when using arrays?

A. Remember that creating an array takes time proportional to the length of the array. You need to be particularly careful about creating arrays within loops.

Exercises

1.4.1 Compose a program that creates a one-dimensional array a containing exactly 1,000 integers and then attempts to access a[1000]. What happens when you run the program?

1.4.2 Given two vectors of length n that are represented with one-dimensional arrays, compose a code fragment that computes the Euclidean distance between them (the square root of the sums of the squares of the differences between corresponding elements).

1.4.3 Compose a code fragment that reverses the order of the elements in a one-dimensional array of floats. Do not create another array to hold the result. Hint: Use the code in the text for exchanging two elements.

1.4.4 What is wrong with the following code fragment?

a = []
for i in range(10):
    a[i] = i * i

Solution: Initially a is the empty array. Subsequently no elements are appended to the array. Thus a[0], a[1], and so forth do not exist. The attempts to use them in an assignment statement will raise an IndexError at run time.

1.4.5 Compose a code fragment that writes the contents of a two-dimensional array of booleans, using * to represent True and a space to represent False. Include row and column numbers.

1.4.6 What does the following code fragment write?

a = stdarray.create1D(10, 0)
for i in range(10):
    a[i] = 9 - i
for i in range(10):
    a[i] = a[a[i]]
for v in a:
    stdio.writeln(v)

1.4.7 What is a[] after executing the following code fragment?

n = 10
a = [0, 1]
for i in range(2, n):
    a += [a[i-1] + a[i-2]]

1.4.8 Compose a program deal.py that takes a command-line argument n and writes n poker hands (five cards each) from a shuffled deck, separated by blank lines.

1.4.9 Compose code fragments to create a two-dimensional array b[][] that is a copy of an existing two-dimensional array a[][], under each of the following assumptions:

a. a is square

b. a is rectangular

c. a may be ragged

Your solution to b should work for a, and your solution to c should work for both b and a.

1.4.10 Compose a code fragment to write the transposition (rows and columns changed) of a two-dimensional array. For the example spreadsheet array in the text, your code would write the following:

99 98 92 94 99 90 76 92 97 89
85 57 77 32 34 46 59 66 71 29
98 78 76 11 22 54 88 89 24 38

1.4.11 Compose a code fragment to transpose a square two-dimensional array in place without creating a second array.

1.4.12 Compose a code fragment to create a two-dimensional array b[][] that is the transpose of an existing m-by-n array a[][].

1.4.13 Compose a program that computes the product of two square matrices of booleans, using the or operator instead of + and the and operator instead of *.

1.4.14 Compose a program that accepts an integer n from the command line and creates an n-by-n boolean array a such that a[i][j] is True if i and j are relatively prime (have no common factors), and False otherwise. Then write the array (see EXERCISE 1.4.5) using * to represent True and a space to represent False. Include row and column numbers. Hint: Use sieving.

1.4.15 Modify the spreadsheet code fragment in the text to compute a weighted average of the rows, where the weights of each test score are in a one-dimensional array weights[]. For example, to assign the last of the three tests in our example to be twice the weight of the others, you would use

weights = [.25, .25, .50]

Note that the weights should sum to 1.0.

1.4.16 Compose a code fragment to multiply two rectangular matrices that are not necessarily square. Note: For the dot product to be well defined, the number of columns in the first matrix must be equal to the number of rows in the second matrix. Write an error message if the dimensions do not satisfy this condition.

1.4.17 Modify selfavoid.py (PROGRAM 1.4.4) to calculate and write the average length of the paths as well as the dead-end probability. Keep separate the average lengths of escape paths and dead-end paths.

1.4.18 Modify selfavoid.py to calculate and write the average area of the smallest axis-oriented rectangle that encloses the path. Keep separate statistics for escape paths and dead-end paths.

1.4.19 Compose a code fragment that creates a three-dimensional n-by-n-by-n array of booleans, with each element initialized to False.

Creative Exercises

1.4.20 Dice simulation. The following code computes the exact probability distribution for the sum of two dice:

probabilities = stdarray.create1D(13, 0.0)
for i in range(1, 7):
    for j in range(1, 7):
        probabilities[i+j] += 1.0
for k in range(2, 13):
    probabilities[k] /= 36.0

After this code completes, probabilities[k] is the probability that the dice sum to k. Run experiments to validate this calculation simulating n dice throws, keeping track of the frequencies of occurrence of each value when you compute the sum of two random integers between 1 and 6. How large does n have to be before your empirical results match the exact results to three decimal places?

1.4.21 Longest plateau. Given an array of integers, compose a program that finds the length and location of the longest contiguous sequence of equal values where the values of the elements just before and just after this sequence are smaller.

1.4.22 Empirical shuffle check. Run computational experiments to check that our shuffling code works as advertised. Compose a program shuffletest.py that takes command-line arguments m and n, does n shuffles of an array of size m that is initialized with a[i] = i before each shuffle, and writes an m-by-m table such that row i gives the number of times i wound up in position j for all j. All elements in the array should be close to n/m.

1.4.23 Bad shuffling. Suppose that you choose a random integer between 0 and n-1 in our shuffling code instead of one between i and n-1. Show that the resulting order is not equally likely to be one of the n! possibilities. Run the test of the previous exercise for this version.

1.4.24 Music shuffling. You set your music player to shuffle mode. It plays each of the m songs before repeating any. Compose a program to estimate the likelihood that you will not hear any sequential pair of songs (that is, song 3 does not follow song 2, song 10 does not follow song 9, and so on).

1.4.25 Minima in permutations. Compose a program that takes an integer n from the command line, generates a random permutation, writes the permutation, and writes the number of left-to-right minima in the permutation (the number of times an element is the smallest seen so far). Then compose a program that takes integers m and n from the command line, generates m random permutations of size n, and writes the average number of left-to-right minima in the permutations generated. Extra credit: Formulate a hypothesis about the number of left-to-right minima in a permutation of size n, as a function of n.

1.4.26 Inverse permutation. Compose a program that reads in a permutation of the integers 0 to n-1 from n command-line arguments and writes its inverse. (If the permutation is an array a[], its inverse is the array b[] such that a[b[i]] = b[a[i]] = i.) Be sure to check that the input is a valid permutation.

1.4.27 Hadamard matrix. The n-by-n Hadamard matrix Hn is a boolean matrix with the remarkable property that any two rows differ in exactly n/2 elements. (This property makes it useful for designing error-correcting codes.) H1 is a 1-by-1 matrix with the single element True, and for n > 1, H2n is obtained by aligning four copies of Hn in a large square, and then inverting all of the elements in the lower right n-by-n copy, as shown in the following examples (with T representing True and F representing False, as usual).

Image

Compose a program that takes one command-line argument n and writes Hn. Assume that n is a power of 2.

1.4.28 Rumors. Alice is throwing a party with n other guests, including Bob. Bob starts a rumor about Alice by telling it to one of the other guests. A person hearing this rumor for the first time will immediately tell it to one other guest, chosen at random from all the people at the party except Alice and the person from whom they heard it. If a person (including Bob) hears the rumor for a second time, he or she will not propagate it further. Compose a program to estimate the probability that everyone at the party (except Alice) will hear the rumor before it stops propagating. Also calculate an estimate of the expected number of people to hear the rumor.

1.4.29 Find a duplicate. Given an array of n elements with each element between 1 and n, compose a code fragment to determine whether there are any duplicates. You do not need to preserve the contents of the given array, but do not use an extra array.

1.4.30 Counting primes. Compare primesieve.py with the alternative approach described in the text. This is an example of a space–time tradeoff: primesieve.py is fast, but requires a boolean array of length n; the alternative approach uses only two integer variables, but is substantially slower. Estimate the magnitude of this difference by finding the value of n for which this second approach can complete the computation in about the same time as python primesieve.py 1000000.

1.4.31 Minesweeper. Compose a program that takes three command-line arguments m, n, and p and produces an m-by-n boolean array where each element is occupied with probability p. In the minesweeper game, occupied cells represent bombs and empty cells represent safe cells. Write the array using an asterisk for bombs and a period for safe cells. Then, replace each safe square with the number of neighboring bombs (above, below, left, right, or diagonal) and write the result, as in this example.

* * . . .        * * 1 0 0
. . . . .        3 3 2 0 0
. * . . .        1 * 1 0 0

Try to express your code so that you have as few special cases as possible to deal with, by using an (m+2)-by-(n+2) boolean array.

1.4.32 Self-avoiding walk length. Suppose that there is no limit on the size of the grid. Run experiments to estimate the average walk length.

1.4.33 Three-dimensional self-avoiding walks. Run experiments to verify that the dead-end probability is 0 for a three-dimensional self-avoiding walk and to compute the average walk length for various values of n.

1.4.34 Random walkers. Suppose that n random walkers, starting in the center of an n-by-n grid, move one step at a time, choosing to go left, right, up, or down with equal probability at each step. Compose a program to help formulate and test a hypothesis about the number of steps taken before all cells are touched.

1.4.35 Bridge hands. In the game of bridge, four players are dealt hands of 13 cards each. An important statistic is the distribution of the number of cards in each suit in a hand. Which is the most likely, 5-3-3-2, 4-4-3-2, or 4-3-3-3? Compose a program to help you answer this question.

1.4.36 Birthday problem. Suppose that people continue to enter an empty room until a pair of people share a birthday. On average, how many people will have to enter before there is a match? Run experiments to estimate the value of this quantity. Assume birthdays to be uniform random integers between 0 and 364.

1.4.37 Coupon collector. Run experiments to validate the classical mathematical result that the expected number of coupons needed to collect n values is about nHn. For example, if you are observing the cards carefully at the blackjack table (and the dealer has enough decks randomly shuffled together), you will wait until about 235 cards are dealt, on average, before seeing every card value.

1.4.38 Riffle shuffle. Compose a program to rearrange a deck of n cards using the Gilbert–Shannon–Reeds model of a riffle shuffle. First, generate a random integer r according to a binomial distribution: flip a fair coin n times and let r be the number of heads. Now, divide the deck into two piles: the first r cards and the remaining nr cards. To complete the shuffle, repeatedly take the top card from one of the two piles and put it on the bottom of a new pile. If there are n1 cards remaining in the first pile and n2 cards remaining in the second pile, choose the next card from the first pile with probability n1 / (n1 + n2) and from the second pile with probability n2 / (n1 + n2). Investigate how many riffle shuffles you need to apply to a deck of 52 cards to produce a (nearly) uniformly shuffled deck.

1.4.39 Binomial coefficients. Compose a program that builds and writes a two-dimensional ragged array a such that a[n][k] contains the probability that you get exactly k heads when you toss a fair coin n times. Take a command-line argument to specify the maximum value of n. These numbers are known as the binomial distribution: if you multiply each element in row k by 2n, you get the binomial coefficients (the coefficients of xk in (x+1)n) arranged in Pascal’s triangle. To compute them, start with a[n][0] = 0.0 for all n and a[1][1] = 1.0, then compute values in successive rows, left to right, with a[n][k] = (a[n-1][k] + a[n-1][k-1])/2.0.

Image

1.5 Input and Output

In this section we extend the set of simple abstractions (command-line arguments and standard output) that we have been using as the interface between our Python programs and the outside world to include standard input, standard drawing, and standard audio. Standard input makes it convenient for us to compose programs that process arbitrary amounts of input and to interact with our programs; standard drawing makes it possible for us to work with graphical representations of images, freeing us from having to encode everything as text; and standard audio adds sound. These extensions are easy to use, and will bring you to yet another new world of programming.

The abbreviation I/O is universally understood to mean input/output, a collective term that refers to the mechanisms by which programs communicate with the outside world. Your computer’s operating system controls the physical devices that are connected to your computer. To implement our “standard I/O” abstractions, we use modules containing functions that interface to the operating system.

You have already been accepting arguments from the command line and writing strings in a terminal window; the purpose of this section is to provide you with a much richer set of tools for processing and presenting data. Like the stdio.write() and stdio.writeln() functions that you have been using, these functions do not implement pure mathematical functions—their purpose is to produce some side effect, either on an input device or on an output device. Our prime concern is using such devices to get information into and out of our programs.

An essential feature of standard I/O mechanisms is that there is no limit on the amount of input or output data, from the point of view of the program. Your programs can consume input or produce output indefinitely.

One use of standard I/O mechanisms is to connect your programs to files on your computer’s external storage. It is easy to connect standard input, standard output, standard drawing, and standard audio to files. Such connections make it easy to have your Python programs save or load results to files for archival purposes or for later reference by other programs or other applications.

Bird’s-eye view

The conventional model that we have been using for Python programming has served us since SECTION 1.1. To build context, we begin by briefly reviewing the model.

A Python program takes input values from the command line and writes a string of characters as output. By default, both command-line arguments and standard output are associated with the application that takes commands (the one in which you have been typing the python command). We use the generic term terminal window to refer to this application. This model has proved to be a convenient and direct way for us to interact with our programs and data.

Command-line arguments

This mechanism, which we have been using to provide input to our programs, is a standard part of Python programming. The operating system presents the command-line arguments that we type to Python programs in an array named sys.argv[]. By convention, both Python and the operating system process the arguments as strings, so if we intend for an argument to be a number, we use a conversion function such as int() or float() to convert it from a string to the appropriate type.

Standard output

To write output values in our programs, we have been using the booksite functions stdio.write() and stdio.writeln(). Python puts the results of a program’s sequence of calls on these functions into the form of an abstract stream of characters known as standard output. By default, the operating system connects standard output to the terminal window. All of the output in our programs so far has been appearing in the terminal window.

For reference, and as a starting point, randomseq.py (PROGRAM 1.5.1) is a program that uses this model. It takes a command-line argument n and produces an output sequence of n random numbers between 0 and 1.

Now we will complement command-line arguments and standard output with three additional mechanisms that address their limitations and provide us with a far more useful programming model. These mechanisms give us a new bird’s-eye view of a Python program in which the program converts a standard input stream and a sequence of command-line arguments into a standard output stream, a standard drawing, and a standard audio stream.


Program 1.5.1 Generating a random sequence (randomseq.py)

import random
import sys
import stdio

n = int(sys.argv[1])
for i in range(n):
    stdio.writeln(random.random())

This program accepts an integer command-line argument n, and writes to standard output a random sequence of n floats in the range [0, 1). The program illustrates the conventional model that we have been using so far for Python programming. From the program’s point of view, there is no limit on the length of the output sequence.


% python randomseq.py 1000000
0.879948024484513
0.8698170909139995
0.6358055797752076
0.9546013485661425
...


Standard input

The booksite stdio module defines several functions in addition to stdio.write() and stdio.writeln(). These additional functions implement a standard input abstraction to complement the standard output abstraction. That is, the stdio module contains functions that allow your programs to read from standard input. Just as you can write to standard output at any time during the execution of your program, so you can read from a standard input stream at any time.

Standard drawing

The booksite stddraw module allows you to create drawings with your programs. It uses a simple graphics model that allows you to create drawings consisting of points, lines, and geometric figures in a window on your computer. The stddraw module also includes facilities for text, color, and animation.

Standard audio

The booksite stdaudio module allows you to create and manipulate sound with your programs. It uses a standard format to convert arrays of floats into sound.

To use these modules, you must make the files stdio.py, stddraw.py, and stdaudio.py available to Python (see the Q&A at the end of this section).

Image

The standard input and standard output abstractions date back to the development of the Unix operating system in the 1970s and are found in some form on all modern systems. Although they are primitive by comparison to various mechanisms developed since, modern programmers still depend on them as a reliable way to connect data to programs. We have developed for this book stddraw and stdaudio in the same spirit as these earlier abstractions to provide you with an easy way to produce visual and aural output.

Standard output

As noted in SECTION 1.2, an application programming interface (API) is a description of the features that a module offers to its clients. At the bottom of this page is the API of the part of the stdio module that is relevant to standard output. The stdio.write() and stdio.writeln() functions are the ones that you have been using. The stdio.writef() function is a main topic of this section and will be of interest to you now because it gives you more control over the appearance of the output. It was a feature of the C language of the early 1970s and continues to survive in modern languages because it is so useful.

Image

Since the first time that we wrote float objects, we have been distracted by excessive precision in the output. For example, when we call stdio.write(math.pi), we get the output 3.141592653589793, even though we might prefer to see 3.14 or 3.14159. The stdio.write() and stdio.writeln() functions present each number to up to 16 decimal digits of precision even when we would be happy with just a few digits of precision. The stdio.writef() function is more flexible: it allows us to specify the number of digits and the precision when converting numeric objects to strings for output. With stdio.writef(), we can write stdio.writef('%7.5f', math.pi) to get 3.14159.

Next, we describe the meaning and operation of these statements, along with extensions to handle the other built-in types of data.

Formatted writing basics

The simplest kind of call of stdio.writef() passes only one argument, a string. In that case, stdio.writef() simply writes the string to standard output, so it is equivalent to stdio.write(). A more commonly used call of stdio.writef() passes two arguments. In that context the first argument is called the format string. It contains a conversion specification that describes how the second argument is to be converted to a string for output. A conversion specification has the form %w.pc, where w and p are small integers and c is a character, to be interpreted as follows:

w is the field width, the number of characters that should be written. If the number of characters to be written exceeds (or equals) the field width, then the field width is ignored; otherwise, the output is padded with spaces on the left. A negative field width indicates that the output instead should be padded with spaces on the right.

p is the precision. For floats, the precision is the number of digits that should be written after the decimal point; for strings, it is the number of characters of the string that should be written. The precision is not used with integers.

c is the conversion code. It should be d when writing an integer, f when writing a float, e when writing a float using scientific notation, and s when writing a string.

Image

The field width and precision can be omitted, but every specification must have a conversion code.

Python must be able to convert the second argument to the type required by the specification. For s, there is no restriction, because every type of data can be converted to a string (via a call to the str() function). In contrast, a statement such as stdio.writef('%12d', 'Hello')—which asks Python to convert a string to an integer—causes Python to raise a TypeError at run time. The table at the bottom of this page shows format strings containing some common conversion specifications.

Any part of the format string that is not a conversion specification is simply passed through to standard output. For example, the statement

stdio.writef('pi is approximately %.2f ', math.pi)

writes the line

pi is approximately 3.14

Note that we need to explicitly include the newline character in the argument to write a new line with stdio.writef().

Multiple arguments

The stdio.writef() function can take more than two arguments. In this case, the format string must have a format specifier for each additional argument, perhaps separated by other characters to pass through to the output. For example, we could replace stdio.write(t) in sqrt.py (PROGRAM 1.3.6) with

stdio.writef('The square root of %.1f is %.6f', c, t)

Image

to get output like

The square root of 2.0 is 1.414214

As a more detailed example, if you were making payments on a loan, you might use code whose inner loop contains the statements

format = '%3s  $%6.2f   $%7.2f   $%5.2f '
stdio.writef(format, month[i], pay, balance, interest)

to write the second and subsequent lines in a table like this (see EXERCISE 1.5.14):

    payment    balance  interest
Jan $299.00   $9742.67   $41.67
Feb $299.00   $9484.26   $40.59
Mar $299.00   $9224.78   $39.52
 ...

Formatted writing is convenient because this sort of code is much more compact than the string-concatenation approach that we have been using to create output strings. We have described only the basic options; see the booksite for many details.

Standard input

Several functions in the booksite stdio module take data from a standard input stream that may be empty or may contain a sequence of values separated by whitespace (spaces, tabs, newline characters, and the like). Each value represents an integer, a float, a boolean, or a string. One of the key features of the standard input stream is that your program consumes values when it reads them. Once your program has read a value, it cannot back up and read it again. This assumption is restrictive, but it reflects physical characteristics of some input devices and simplifies implementing the abstraction. The stdio module offers 13 functions that are related to reading from standard input, as shown in the API on the next page. These functions fall into one of three categories: those for reading individual tokens, one at a time, and converting each to an integer, float, boolean or string; those for reading lines from standard input, one at a time; and those for reading a sequence of values of the same type (returning the values in an array). Generally, it is best not to mix functions from the different categories in the same program. Within the input stream model, those functions are largely self-documenting (the names describe their effect), but their precise operation is worthy of careful consideration, so we will consider several examples in detail.

Image
Image
Typing input

When you use the python command to invoke a Python program from the command line, you actually are doing three things: (1) issuing a command to start executing your program, (2) specifying the values of the command line arguments, and (3) beginning to define the standard input stream. The string of characters that you type in the terminal window after the command line is the standard input stream. When you type characters, you are interacting with your program. The program waits for you to create the standard input stream. For example, consider the following program addints.py, which takes a integer n from the command line and then reads n integers from standard input, adds them, and writes the sum to standard output:

import sys
import stdio
n = int(sys.argv[1])
total = 0
for i in range(n):
    total += stdio.readInt()
stdio.writeln('Sum is ' + str(total))

When you type python addints.py 4, the program starts execution. It takes the command-line argument, initializes total to 0, enters the for loop, eventually calls stdio.readInt(), and waits for you to type an integer. Suppose that you want 144 to be the first value. As you type 1, then 4, and then 4, nothing happens, because stdio does not know that you are done typing the integer. But when you finally type <return> to signify the end of your integer, stdio.readInt() immediately returns the value 144, which your program adds to total and then calls stdio.readInt() again. Again, nothing happens until you type the second value: if you type 2, then 3, then 3, and then <return> to end the number, stdio.readInt() returns the value 233, which your program again adds to total. After you have typed four numbers in this way, the program expects no more input and writes the sum, as desired. In the command-line traces, we use boldface to highlight the text that you type and differentiate it from the output of the program.

Image
Input format

The stdio.readInt() function expects an integer. If you type abc or 12.2 or True, Python raises a ValueError at run time. The format for each type is the same as you have been using to specify literals within Python programs. For convenience, stdio treats strings of consecutive whitespace characters as identical to one space and allows you to delimit your numbers with such strings. It does not matter how many spaces you put between numbers, or whether you enter numbers on one line or separate them with tab characters or spread them out over several lines (except that your terminal application processes standard input one line at a time, so it will wait until you type <return> before sending all of the numbers on that line to standard input). You can mix values of different types in an input stream, but whenever the program expects a value of a particular type, the input stream must have a value of that type.

Interactive user input

PROGRAM 1.5.2 (twentyquestions.py) is a simple example of a program that interacts with its user. The program generates a random integer and then gives clues to a user trying to guess the number. (Note: By using binary search, you can always get to the answer in at most 20 questions. See SECTION 4.2.) The fundamental difference between this program and others that we have composed is that the user has the ability to change the control flow while the program is executing. This capability was very important in early applications of computing, but we rarely compose such programs nowadays because modern applications typically take such input through a graphical user interface, as discussed in CHAPTER 3. Even a simple program like twentyquestions.py illustrates that composing programs that support user interaction is potentially very difficult because you have to plan for all possible user inputs.

Processing an arbitrary-size input stream

Typically, input streams are finite: your program marches through the input stream, consuming values until the stream is empty. But there is no restriction on the size of the input stream, and some programs simply process all the input presented to them. Our next example, average.py (PROGRAM 1.5.3), reads in a sequence of real numbers from standard input and writes their average. It illustrates a key property of using an input stream: the length of the stream is not known to the program. We type all the numbers that we have, and then the program averages them. Before reading each number, the program calls the function stdio.isEmpty() to check whether there are any more numbers in the input stream.


Program 1.5.2 Interactive user input (twentyquestions.py)

import random
import stdio

RANGE = 1000000

secret = random.randrange(1, RANGE+1)
stdio.write('I am thinking of a secret number between 1 and ')
stdio.writeln(RANGE)

guess = 0
while guess != secret:
    # Solicit one guess and provide one answer.
    stdio.write('What is your guess? ')
    guess = stdio.readInt()

    if   (guess < secret): stdio.writeln('Too low')
    elif (guess > secret): stdio.writeln('Too high')
    else:                  stdio.writeln('You win!')


secret | secret value
 guess | user's guess

This program generates a random integer between 1 and 1 million. Then it repeatedly reads user guesses from standard input. It writes Too low or Too high to standard output, as appropriate, in response to each guess. It writes You win! to standard output and exits when the user’s guess is correct. You always can get the program to write You win! with fewer than 20 questions.


% python twentyquestions.py
I am thinking of a secret number between 1 and 1000000
What is your guess? 500000
Too high
What is your guess? 250000
Too low
What is your guess? 375000
Too high
What is your guess? 312500
Too high
What is your guess? 300500
Too low
...


How do we signal that we have no more data to type? By convention, we type a special sequence of characters known as the end-of-file sequence. Sadly, the terminal applications that we typically encounter on modern operating systems use different conventions for this critically important sequence. In this book, we use <Ctrl-d> (many systems require <Ctrl-d> to appear on a line by itself); the other widely used convention is <Ctrl-z> on a line by itself.

Actually, we rarely type numbers one by one on standard input. Instead, we keep our input data in files, as illustrated in the example accompanying PROGRAM 1.5.3 and explained in detail in the text.

Certainly average.py is a simple program, but it represents a profound new capability in programming: with standard input, we can compose programs that process an unlimited amount of data. As you will see, composing such programs is an effective approach for numerous data-processing applications.

Standard input is a substantial step up from the command-line argument model that we have been using, for two reasons, as illustrated by twentyquestions.py and average.py. First, we can interact with our program—with command-line arguments, we can provide data to the program only before it begins execution. Second, we can read in large amounts of data—with command-line arguments, we can enter only values that fit on the command line. Indeed, as illustrated by average.py, the amount of data processed by a program can be potentially unlimited, and many programs are made simpler by that assumption. A third reason for standard input is that your operating system makes it possible to change the source of standard input, so that you do not have to type all the input. Next, we consider the mechanisms that enable this possibility.

Redirection and piping

For many applications, typing input data as a standard input stream from the terminal window is untenable because our program’s processing power is then limited by the amount of data that we can type (and our typing speed). Similarly, we often want to save the information written on the standard output stream for later use. To address such limitations, we next focus on the idea that standard input is an abstraction—the program just expects its input and has no dependence on the source of the input stream. Standard output is a similar abstraction. The power of these abstractions derives from our ability (through the operating system) to specify various other sources for standard input and standard output, such as a file, the network, or another program. All modern operating systems implement these mechanisms.


Program 1.5.3 Averaging a stream of numbers (average.py)

import stdio

total = 0.0
count = 0
while not stdio.isEmpty():
    value = stdio.readFloat()
    total += value
    count += 1
avg = total / count

stdio.writeln('Average is ' + str(avg))


count | count of numbers read
total | cumulated sum

This program reads floats from the standard input stream until it reaches the end-of-file. Then it writes to standard output the average of those floats. From its point of view, there is no limit on the size of the input stream. The commands shown below after the first one use redirection and piping (discussed in the next subsection) to provide 100,000 numbers to average.py.


% python average.py
10.0 5.0 6.0
3.0
7.0 32.0
<Ctrl-d>
Average is 10.5


% python ramdomseq.py 1000 > data.txt
% python average.py < data.txt
Average is 0.510473676174824

% python randomseq.py 1000 | python average.py
Average is 0.50499417963857


Redirecting standard output to a file

By adding a simple directive to the command that executes a program, we can redirect its standard output to a file, either for permanent storage or for input to another program at a later time. For example,

% python randomseq.py 1000 > data.txt

specifies that the standard output stream is not to be written in the terminal window, but instead is to be written to a text file named data.txt. Each call to stdio.write(), stdio.writeln(), or stdio.writef() appends text at the end of that file. In this example, the end result is a file that contains 1,000 random values. No output appears in the terminal window: it goes directly into the file named after the > symbol. Thus, we can save information for later retrieval. Note that we do not have to change randomseq.py (PROGRAM 1.5.1) in any way for this mechanism to work—it relies on using the standard output abstraction and is unaffected by our use of a different implementation of that abstraction. You can use this mechanism to save output from any program that you compose. Once we have expended a significant amount of effort to obtain a result, we often want to save the result for later reference. In a modern system, you can save some information by using cut-and-paste or some similar mechanism that is provided by the operating system, but cut-and-paste is inconvenient for large amounts of data. By contrast, redirection is specifically designed to make it easy to handle large amounts of data.

Image
Redirecting from a file to standard input

We can redirect standard input so that our program reads data from a file instead of the terminal application:

% python average.py < data.txt

This command reads a sequence of numbers from the file data.txt and computes their average value. Specifically, the < symbol is a directive that tells the operating system to implement the standard input stream by reading from the text file data.txt instead of waiting for the user to type something into the terminal window. When the program calls stdio.readFloat(), the operating system reads the value from the file. The file data.txt could have been created by any application, not just a Python program—almost every application on your computer can create text files. This facility to redirect from a file to standard input enables us to create data-driven code, in which we can change the data processed by a program without having to change the program at all. Instead, we keep data in files and compose programs that read from standard input.

Image
Connecting two programs

The most flexible way to implement the standard input and standard output abstractions is to specify that they are implemented by our own programs! This mechanism is called piping. For example, the command

% python randomseq.py 1000 | python average.py

specifies that the standard output for randomseq.py and the standard input stream for average.py are the same stream. The effect is as if randomseq.py were typing the numbers it generates into the terminal window while average.py is running. This example also has the same effect as the following sequence of commands:

% python randomseq.py 1000 > data.txt
% python average.py < data.txt

With piping, however, the file data.txt is not created. This difference is profound, because it removes another limitation on the size of the input and output streams that we can process. For example, we could replace 1000 in our example with 1000000000, even though we might not have the space to save a billion numbers on our computer (we do need the time to process them, however). When randomseq.py calls stdio.writeln(), a string is added to the end of the stream; when average.py calls stdio.readFloat(), a string is removed from the beginning of the stream. The timing of precisely what happens is up to the operating system: it might run randomseq.py until it produces some output, and then run average.py to consume that output, or it might run average.py until it needs some output, and then run randomseq.py until it produces the needed output. The end result is the same, but our programs are freed from worrying about such details because they work solely with the standard input and standard output abstractions.

Image
Filters

Piping, a core feature of the original Unix system of the early 1970s, still survives in modern systems because it is a simple abstraction for communicating among disparate programs. Testimony to the power of this abstraction is the fact that many Unix programs are still being used today to process files that are thousands or millions of times larger than imagined by the programs’ authors. We can communicate with other Python programs via calls on functions, but standard input and standard output allow us to communicate with programs that were written at another time and, perhaps, in another language. When we use standard input and standard output, we are agreeing on a simple interface to the outside world.

For many common tasks, it is convenient to think of each program as a filter that converts a standard input stream to a standard output stream in some way, with piping as the command mechanism to connect programs together. For example, rangefilter.py (PROGRAM 1.5.4) takes two command-line arguments and writes to standard output those numbers from standard input that fall within the specified range. You might imagine standard input to be measurement data from some instrument, with the filter being used to throw away data outside the range of interest for the experiment at hand.

Several standard filters that were designed for Unix still survive (sometimes with different names) as commands in modern operating systems. For example, the sort filter reads the lines from standard input and writes them to standard output in sorted order:

% python randomseq.py 9 | sort
0.0472650078535
0.0681950168757
0.0967410236589
0.0974385525393
0.118855769243
0.46604926859
0.522853708616
0.599692836211
0.685576779833

We discuss sorting in SECTION 4.2. A second useful filter is grep, which writes the lines from standard input that match a given pattern. For example, if you type

% grep lo < rangefilter.py


Program 1.5.4 A simple filter (rangefilter.py)

import sys
import stdio

lo = int(sys.argv[1])
hi = int(sys.argv[2])

while not stdio.isEmpty():
    # Process one integer.
    value = stdio.readInt()
    if (value >= lo) and (value <= hi):
        stdio.write(str(value) + ' ')
stdio.writeln()


  lo  | lower bound of range
  hi  | upper bound of range
value | current number

This program accepts integer command-line arguments lo and hi and then reads integers from standard input until it reaches end-of-file, writing to standard output each of those integers that is in the range lo to hi, inclusive. Thus the program is a filter (see text). There is no limit on the length of the streams.


% more rangedata.txt
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9


% python rangefilter.py 5 9 < rangedata.txt
5 9 6 5 5 8 9 7 9 8 6 6 8 7 9


% python rangefilter.py 100 400
358 1330 55 165 689 1014 3066 387 575 843 203 48 292 877 65 998
358 165 387 203 292
<Ctrl-d>


you get all the lines in rangefilter.py that contain 'lo':

lo = int(sys.argv[1])
    if (value >= lo) and (value <= hi):

Programmers often use tools such as grep to get a quick reminder of variable names or language usage details. A third useful filter is more, which reads data from standard input (or from a file specified as a command-line argument) and displays it in your terminal window one screenful at a time. For example, if you type

% python randomseq.py 1000 | more

you will see as many numbers as fit in your terminal window, but more will wait for you to hit the space bar before displaying each succeeding screenful. The term filter is perhaps misleading: it was meant to describe programs like rangefilter.py that write some subsequence of standard input to standard output, but it is now often used to describe any program that reads from standard input and writes to standard output.

Multiple streams

For many common tasks, we want to compose programs that take input from multiple sources and/or produce output intended for multiple destinations. In SECTION 3.1 we discuss our instream.py and outstream.py modules, which generalize stdio.py to allow for multiple input and output streams. These modules include provisions for redirecting these streams, not just to and from files, but also from arbitrary web pages.

Processing large amounts of information plays an essential role in many applications of computing. A scientist may need to analyze data collected from a series of experiments, a stock trader may wish to analyze information about recent financial transactions, or a student may wish to maintain collections of music and movies. In these and countless other applications, data-driven programs are the norm. Standard output, standard input, redirection, and piping provide us with the capability to address such applications with our Python programs. We can collect data into files on our computer through the web or any of the standard devices and use redirection and piping to connect data to our programs.

Standard drawing

Up to this point, our input/output abstractions have focused exclusively on text input and output. Now we introduce an abstraction for producing drawings as output. This module is easy to use and allows us to take advantage of a visual medium to cope with far more information than is possible with just text.

Standard drawing is very simple: we imagine an abstract drawing device capable of drawing lines and points on a two-dimensional canvas and then displaying that “canvas” on your screen in the standard drawing window. The device is capable of responding to the commands that our programs issue in the form of calls to functions in the stddraw module.

The module’s API consists of two kinds of functions: drawing functions that cause the device to take an action (such as drawing a line or drawing a point) and control functions that control how the drawing is shown and set parameters such as the pen size or the coordinate scales.

Creating drawings

The basic functions for drawing are described in the API at the bottom of this page. Like the functions for standard input and standard output, the drawing functions are nearly self-documenting: stddraw.line() draws a straight line segment connecting two points whose coordinates are given as arguments and stddraw.point() draws a dot centered at the given coordinates. The default coordinate scale is the unit square (all coordinates between 0 and 1). The point (0.0, 0.0) is at the lower left, and the point (1.0, 1.0) is at the upper right—thus corresponding to the first quadrant of the familiar Cartesian coordinate system. The default settings draw black lines and black points on a white background.

The control function stddraw.show() needs a bit more explanation. When your program calls any drawing function such as stddraw.line() or stddraw.point(), stddraw uses an abstraction known as the background canvas. The background canvas is not displayed; it exists only in computer memory. All points, lines, and so forth are drawn on the background canvas, not directly in the standard drawing window. Only when you call stddraw.show() does your drawing get copied from the background canvas to the standard drawing window, where it is displayed until the user closes the standard drawing window—typically by clicking on the Close button in the window’s title bar.

Image

Why does stddraw need to use a background canvas? The main reason is that use of two canvases instead of one makes the stddraw module more efficient. Incrementally displaying a complex drawing as it is being created can be intolerably inefficient on many computer systems. In computer graphics, this technique is known as double buffering.

To summarize the information that you need to know, a typical program using the stddraw module has this structure:

• Import the stddraw module.

• Call drawing functions such as stddraw.line() and stddraw.point() to create a drawing on the background canvas.

• Call stddraw.show() to show the background canvas in the standard drawing window and wait until the window is closed.

Image

It is important to remember that all drawing goes to the background canvas. Typically, a program that creates a drawing finishes with a call to stddraw.show(), because only when stddraw.show() is called will you see your picture.

Next, we will consider several examples that will open up a whole new world of programming by freeing you from the restriction of communicating with your program just through text.

Your first drawing

The “Hello, World” equivalent for graphics programming with stddraw is to draw a triangle with a point inside. To form the triangle, we draw three lines: one from the point (0, 0) at the lower-left corner to the point (1, 0), one from that point to the third point at (1/2, Image), and one from that point back to (0, 0). As a final flourish, we draw a spot in the middle of the triangle. Once you have successfully downloaded and run triangle.py, you are ready to compose your own programs that draw figures consisting of lines and points. This ability literally adds a new dimension to the output that you can produce.

Image

When you use a computer to create drawings, you get immediate feedback (the drawing) so that you can refine and improve your program quickly. With a computer program, you can create drawings that you could not contemplate making by hand. In particular, instead of viewing our data as just numbers, we can use pictures, which are far more expressive. We will consider other graphics examples after we discuss a few other drawing commands.

Saving a drawing

You can save the standard drawing window canvas to a file, thus giving you the ability to print a drawing or to share a drawing with other people. To do so, right-click anywhere on the window canvas (usually while stddraw is waiting forever because your program called its stddraw.show() function). After you do that, stddraw displays a file dialog box which allows you to specify a file name. Then, after you type a file name into the dialog box and click the Save button, stddraw saves the window canvas to a file with the specified name. The file name must end with either .jpg (to save the window canvas in JPEG format) or .png (to save the window canvas in “Portable Network Graphics” format). The drawings generated by the graphics programs shown in this chapter were saved to files using this mechanism.

Control commands

The default coordinate system for standard drawing is the unit square, but we often want to draw plots at different scales. For example, a typical situation is to use coordinates in some range for the x-coordinate, or the y-coordinate, or both. Also, we often want to draw lines of different thickness and points of different size from the standard. To accommodate these needs, stddraw has the functions shown at the top of the next page.

For example, when you call the function stddraw.setXscale(0, n), you are telling the drawing device that you will be using x-coordinates between 0 and n. Note that the two-call sequence

stddraw.setXscale(x0, x1)
stddraw.setYscale(y0, y1)

Image

sets the drawing coordinates to be within a bounding box whose lower-left corner is at (x0, y0) and whose upper-right corner is at (x1, y1). If you use integer coordinates, Python promotes them to floats, as expected. The figure at right is a simple example that demonstrates the utility of scaling. Scaling is the simplest of the transformations commonly used in graphics. Several of the applications that we consider in this chapter are typical—we use scaling in a straightforward way to match our drawings to our data.

The pen is circular, so that when you set the pen radius to r and draw a point, you get a circle of radius r. Also, lines are of thickness 2r and have rounded ends. The default pen radius is 0.005 and is not affected by coordinate scaling. This default pen radius is about 1/200 the width of the default window, so that if you draw 100 points equally spaced along a horizontal or vertical line, you will be able to see individual circles, but if you draw 200 such points, the result will look like a line. When you make the function call stddraw.setPenRadius(.025), you are saying that you want the thickness of the lines and the size of the points to be five times the 0.005 standard. To draw points with the minimum possible radius (one pixel on typical displays), set the pen radius to 0.0.

Image

Program 1.5.5 Standard input to drawing filter (plotfilter.py)

import stddraw
import stdio

# Read and set the x- and y-scales.
x0 = stdio.readFloat()
y0 = stdio.readFloat()
x1 = stdio.readFloat()
y1 = stdio.readFloat()
stddraw.setXscale(x0, x1)
stddraw.setYscale(y0, y1)

# Read and plot the points.
stddraw.setPenRadius(0.0)
while not stdio.isEmpty():
    x = stdio.readFloat()
    y = stdio.readFloat()
    stddraw.point(x, y)

stddraw.show()


 x0  | left bound
 y0  | bottom bound
 x1  | right bound
 y1  | top bound
x, y | current point

This program reads x- and y-scales from standard input, and configures the stddraw canvas accordingly. Then it reads points from standard input until it reaches end-of-file, and plots them to standard drawing. The file usa.txt on the booksite has the coordinates of the U.S. cities with populations over 500. Some data, such as the data in usa.txt, is inherently visual.


Image

Filtering data to a standard drawing

One of the simplest applications of standard drawing is to plot data, by filtering it from standard input to the drawing. PROGRAM 1.5.5 (plotfilter.py) is such a filter: it reads a sequence of points defined by (x, y) coordinates and draws a spot at each point. It adopts the convention that the first four numbers read from standard input specify the bounding box, so that it can scale the plot without having to make an extra pass through all the points to determine the scale (this kind of convention is typical with such data files).

The graphical representation of points plotted in this way is far more expressive (and far more compact) than the numbers themselves or anything that we could create with the standard output representation that our programs have been limited to until now. The plotted image that is produced by plotfilter.py makes it far easier for us to infer properties of the cities (such as, for example, clustering of population centers) than does a list of the coordinates. Whenever we are processing data that represents the physical world, a visual image is likely to be one of the most meaningful ways in which we can use to display output. PROGRAM 1.5.5 illustrates just how easily you can create such an image.

Plotting a function graph

Another important use of stddraw is to plot experimental data or the values of a mathematical function. For example, suppose that we want to plot values of the function y = sin(4x) + sin(20x) in the interval [0, π]. Accomplishing this task is a prototypical example of sampling: there are an infinite number of points in the interval, so we have to make do with evaluating the function at a finite number of points within the interval. We sample the function by choosing a set of x-values, then computing y-values by evaluating the function at each x-value. Plotting the function by connecting successive points with lines produces a piecewise linear approximation. The simplest way to proceed is to regularly space the x values: we decide ahead of time on a sample size, then space the x-coordinates by the interval size divided by the sample size. To make sure that the values we plot fall in the visible canvas, we scale the x-axis corresponding to the interval and the y-axis corresponding to the maximum and minimum values of the function within the interval. PROGRAM 1.5.6 (functiongraph.py) gives the Python code for this process.

The smoothness of the curve depends on properties of the function and the size of the sample. If the sample size is too small, the rendition of the function may not be at all accurate (it might not be very smooth, and it might miss major fluctuations, as shown in the example on the next page); if the sample is too large, producing the plot may be time-consuming, since some functions are time-consuming to compute. (In SECTION 2.4, we will look at an efficient method for accurately plotting a smooth curve.) You can use this same technique to plot the function graph of any function you choose: identify an x-interval where you want to plot the function, compute function values evenly spaced in that interval and store them in an array, determine and set the y-scale, and draw the line segments.


Program 1.5.6 Function graph (functiongraph.py)

import math
import sys
import stdarray
import stddraw

n = int(sys.argv[1])

x = stdarray.create1D(n+1, 0.0)
y = stdarray.create1D(n+1, 0.0)
for i in range(n+1):
    x[i] = math.pi * i / n
    y[i] = math.sin(4.0*x[i]) + math.sin(20.0*x[i])
stddraw.setXscale(0, math.pi)
stddraw.setYscale(-2.0, +2.0)
for i in range(n):
    stddraw.line(x[i], y[i], x[i+1], y[i+1])

stddraw.show()


 n  | # of samples
x[] | x-coordinates
y[] | y-coordinates

This program accepts integer command-line argument n and then plots a piecewise linear approximation to the function y = sin(4x) + sin(20x) by sampling the function at n+1 points between x = 0 and x = π and drawing n line segments. This example illustrates the need for choosing the number of samples carefully—on the left, with only 20 samples, we miss most of the fluctuations in the curve.


Image

Outline and filled shapes

The booskite’s stddraw module also includes functions to draw circles, rectangles, and arbitrary polygons. Each shape defines an outline. When the function name is just the shape name, that outline is traced by the drawing pen. When the name begins with filled, the named shape is instead filled solid, not traced. As usual, we summarize the available functions in an API, shown at the bottom of this page.

The arguments for stddraw.circle() and stddraw.filledCircle() define a circle of radius r centered at (x, y); the arguments for stddraw.square() and stddraw.filledSquare() define a square of side length 2r centered at (x, y); and the arguments for stddraw.polygon() and stddraw.filledPolygon() define a sequence of points that we connect by lines, including one from the last point to the first point. If you want to define shapes other than squares or circles, use one of these functions. To check your understanding, try to figure out what this code draws, before reading the answer:

xd = [x-r, x, x+r, x]
yd = [y, y+r, y, y-r]
stddraw.polygon(xd, yd)

Image

The answer is that you would never know, because it draws on the background canvas and there is no call to stddraw.show(). If there were such a call, it would draw a diamond (a rotated square) centered at the point (x, y). Several other examples of code that draws shapes and filled shapes are shown on the facing page.

Text and color

Occasionally, you may wish to annotate or highlight various elements in your drawings. The stddraw module has a function for drawing text, two other functions for setting parameters associated with text, and another for changing the color of the ink in the pen. We make scant use of these features in this book, but they can be very useful, particularly for drawings on your computer screen. You will find many examples of their use on the booksite.

In this code, color and fonts use types that you will learn about in SECTION 3.1. Until then, we leave the details to stddraw. The available pen colors are BLACK, BLUE, CYAN, DARK_GRAY, GRAY, GREEN, LIGHT_GRAY, MAGENTA, ORANGE, PINK, RED, WHITE, and YELLOW, defined as constants within stddraw. For example, the call stddraw.setPenColor(stddraw.GRAY) changes to gray ink. The default ink color is stddraw.BLACK. The default font in stddraw suffices for most of the drawings that you need (and you can find information on using other fonts on the booksite). For example, you might wish to call these functions to annotate function plots to highlight relevant values, and you might find it useful to compose similar functions to annotate other parts of your drawings.

Shapes, color, and text are basic tools that you can use to produce a dizzying variety of images, but you should use them judiciously. Use of such artifacts usually presents a design challenge, and our stddraw commands are crude by the standards of modern graphics libraries, so that you are likely to need more code than you might expect to produce the beautiful images that you may imagine.

Image
Image

Animation

If we provide an argument to stddraw.show(), then that call need not be the last action of a program: it will copy the background canvas to the standard drawing window and then wait for the specified number of milliseconds. As you will soon see, this capability (coupled with the ability to erase, or clear the background canvas) provides limitless opportunities for creating interesting effects involving dynamic changes in the images in the standard drawing window. Such effects can provide compelling visualizations. We give an example next that also works for the printed page. The booksite offers even more examples that are likely to capture your imagination.

Image
Bouncing ball

The “Hello, World” program for animation is to produce a black ball that appears to move around on the canvas. Suppose that the ball is at position (rx, ry) and we want to create the impression of moving it to a new position nearby, such as, for example, (rx + 0.01, ry + 0.02). We do so in three steps:

• Clear the background canvas.

• Draw a black ball at the new position.

• Show the drawing and wait for a short while.

To create the illusion of movement, we iterate these steps for a whole sequence of positions (one that will form a straight line, in this case). The argument to stddraw.show() quantifies “a short while” and controls the apparent speed.

Image

Program 1.5.7 Bouncing ball (bouncingball.py)

import stddraw

stddraw.setXscale(-1.0, 1.0)
stddraw.setYscale(-1.0, 1.0)

DT = 20.0
RADIUS = 0.05
rx = 0.480
ry = 0.860
vx = 0.015
vy = 0.023

while True:
    # Update ball position and draw it there.
    if abs(rx + vx) + RADIUS > 1.0: vx = -vx
    if abs(ry + vy) + RADIUS > 1.0: vy = -vy
    rx = rx + vx
    ry = ry + vy

    stddraw.clear(stddraw.GRAY)
    stddraw.filledCircle(rx, ry, RADIUS)
    stddraw.show(DT)


  DT   | wait time
RADIUS | ball radius
rx, ry | position
vx, vy | velocity

This program draws a bouncing ball to standard drawing. That is, it simulates the movement of a bouncing ball in the unit box. The ball bounces off the boundary according to the laws of elastic collision. The 20-millisecond wait period keeps the black image of the ball persistent on the screen, even though most of the ball’s pixels alternate between black and white. If you modify this code to take the wait time dt as a command-line argument, you can control the speed of the ball. The images below, which show the track of the ball, are produced by a modified version of this code where the call to stddraw.clear() is outside the loop (see Exercise 1.5.34).


Image

PROGRAM 1.5.7 (bouncingball.py) implements these steps to create the illusion of a ball moving in the 2-by-2 box centered at the origin. The current position of the ball is (rx, ry), and we compute the new position at each step by adding vx to rx and vy to ry. Since (vx, vy) is the fixed distance that the ball moves in each time unit, it represents the velocity. To keep the ball in the drawing, we simulate the effect of the ball bouncing off the walls according to the laws of elastic collision. This effect is easy to implement: when the ball hits a vertical wall, we just change the velocity in the x-direction from vx to –vx, and when the ball hits a horizontal wall, we change the velocity in the y-direction from vy to –vy. Of course, you have to download the code from the booksite and run it on your computer to see motion.

Since we cannot show a moving image on the printed page, we slightly modified bouncingball.py to show the track of the ball as it moves (see EXERCISE 1.5.34) in the examples shown below the code.

To familiarize yourself with animation on your computer, you are encouraged to modify the various parameters in bouncingball.py to draw a bigger ball, make it move faster or slower, and experiment with the distinction between the velocity in the simulation and the apparent speed on your display. For maximum flexibility, you might wish to modify bouncingball.py to take all these parameters as command-line arguments.

Standard drawing substantially improves our programming model by adding a “picture is worth a thousand words” component. It is a natural abstraction that you can use to better open up your programs to the outside world. With it, you can easily produce the function plots and visual representations of data that are commonly used in science and engineering. We will put it to such uses frequently throughout this book. Any time that you spend now working with the sample programs on the last few pages will be well worth the investment. You can find many useful examples on the booksite and in the exercises, and you are certain to find some outlet for your creativity by using stddraw to meet various challenges. Can you draw an n-pointed star? Can you make our bouncing ball actually bounce (add gravity)? You may be surprised at how easily you can accomplish these and other tasks.

Standard audio

As a final example of a basic abstraction for output, we consider stdaudio, a module that you can use to play, manipulate, and synthesize sound. You probably have used your computer to process music; now you can compose programs to do so. At the same time, you will learn some concepts behind a venerable and important area of computer science and scientific computing: digital signal processing. We will merely scratch the surface of this fascinating subject, but you may be surprised at the simplicity of the underlying concepts.

Concert A

Sound is the perception of the vibration of molecules—in particular, the vibration of our eardrums. Therefore, oscillation is the key to understanding sound. Perhaps the simplest place to start is to consider the musical note A above middle C, which is known as concert A. This note is nothing more than a sine wave, scaled to oscillate at a frequency of 440 times per second. The function sin(t) repeats itself once every 2π units, so if we measure t in seconds and plot the function sin(2πt ×440), we get a curve that oscillates 440 times per second. When you play an A by plucking a guitar string, pushing air through a trumpet, or causing a small cone to vibrate in a speaker, this sine wave is the prominent part of the sound that you hear and recognize as concert A. We measure frequency in hertz (cycles per second). When you double or halve the frequency, you move up or down one octave on the scale. For example, 880 hertz is one octave above concert A and 110 hertz is two octaves below concert A. For reference, the frequency range of human hearing is about 20 to 20,000 hertz. The amplitude (y-value) of a sound corresponds to the volume. We plot our curves between –1 and +1 and assume that any devices that record and play sound will scale as appropriate, with further scaling controlled by you when you turn the volume knob.

Image
Other notes

A simple mathematical formula characterizes the other notes on the chromatic scale. There are 12 notes on the chromatic scale, divided equally on a logarithmic (base 2) scale. We get the i th note above a given note by multiplying its frequency by the (i /12)th power of 2. In other words, the frequency of each note in the chromatic scale is precisely the frequency of the previous note in the scale multiplied by the twelfth root of 2 (about 1.06). This information suffices to create music! For example, to play the tune Frère Jacques, we just need to play each of the notes A B C# A by producing sine waves of the appropriate frequency for about half a second and then repeat the pattern.

Sampling

For digital sound, we represent a curve by sampling it at regular intervals, in precisely the same manner as when we plot function graphs. We sample sufficiently often that we have an accurate representation of the curve—a widely used sampling rate for digital sound is 44,100 samples per second. For concert A, that rate corresponds to plotting each cycle of the sine wave by sampling it at about 100 points. Since we sample at regular intervals, we need to compute only the y-coordinates of the sample points. It is that simple: we represent sound as an array of numbers (float values that are between –1 and +1). Our booksite sound module function stdaudio.playSamples() takes an array of floats as its argument and plays the sound represented by that array on your computer.

For example, suppose that you want to play concert A for 10 seconds. At 44,100 samples per second, you need an array of 441,001 float values. To fill in the array, use a for loop that samples the function sin(2πt × 440) at t = 0 / 44100, 1 / 44100, 2 / 44100, 3 / 44100, ..., 441000 / 44100. Once we fill the array with these values, we are ready for stdaudio.playSamples(), as in the following code:

SPS = 44100              # samples per second
hz = 440.0               # concert A
duration = 10.0          # ten seconds
n = int(SPS * duration)

a = stdarray.create1D(n+1)
for i in range(n+1):
    a[i] = math.sin(2.0 * math.pi * i * hz / SPS)
stdaudio.playSamples(a)
stdaudio.wait()

This code is the “Hello, World” of digital audio. Once you use it to get your computer to play this note, you can compose code to play other notes and make music! The difference between creating sound and plotting an oscillating curve is nothing more than the output device. Indeed, it is instructive and entertaining to send the same numbers to both standard drawing and standard audio (see EXERCISE 1.5.27).

Saving to a file

Music can take up a lot of space on your computer. At 44,100 samples per second, a four-minute song corresponds to 4 × 60 × 44100=10,584,000 numbers. Therefore, it is common to represent the numbers corresponding to a song in a binary format that uses less space than the string-of-digits representation that we use for standard input and output. Many such formats have been developed in recent years—stdaudio uses the .wav format. You can find some information about the .wav format on the booksite, but you do not need to know the details, because stdaudio takes care of the conversions for you. The stdaudio module allows you to play .wav files, to compose programs to create and manipulate arrays of floats, and to read and write them as .wav files.

Image

To see how easily we can create music with the functions in the stdaudio module (detailed at the top of page 175), consider PROGRAM 1.5.8 (playthattune.py). It takes notes from standard input, indexed on the chromatic scale from concert A, and plays them to standard audio. You can imagine all sorts of extensions on this basic scheme, some of which are addressed in the exercises.


Program 1.5.8 Digital signal processing (playthattune.py)

import math
import stdarray
import stdaudio
import stdio

SPS = 44100
CONCERT_A = 440.0

while not stdio.isEmpty():
    pitch = stdio.readInt()
    duration = stdio.readFloat()
    hz = CONCERT_A * (2 ** (pitch / 12.0))
    n = int(SPS * duration)
    samples = stdarray.create1D(n+1, 0.0)
    for i in range(n+1):
        samples[i] = math.sin(2.0 * math.pi * i * hz / SPS)
    stdaudio.playSamples(samples)
stdaudio.wait()


 pitch    | distance from A
duration  | note play time
   hz     | frequency
    n     | number of samples
samples[] | sampled sine wave

This program reads sound samples from standard input, and plays the sound to standard audio. This data-driven program plays pure tones from the notes on the chromatic scale, specified on standard input as a pitch (distance from concert A) and a duration (in seconds). The test client reads the notes from standard input, creates an array by sampling a sine wave of the specified frequency and duration at 44,100 samples per second, and then plays each note by calling stdaudio.playSamples().


% more elise.txt
7 .25
6 .25
7 .25
6 .25
7 .25
2 .25
5 .25
3 .25
0 .50


Image

We include stdaudio in our basic arsenal of programming tools because sound processing is one important application of scientific computing that is certainly familiar to you. Not only has the commercial application of digital signal processing had a phenomenal impact on modern society, but the science and engineering behind it combine physics and computer science in interesting ways. We will study more components of digital signal processing in some detail later in the book. (For example, you will learn in SECTION 2.1 how to create sounds that are more musical than the pure sounds produced by playthattune.py.)

Image

Summary

I/O is a particularly convincing example of the power of abstraction because standard input, standard output, standard drawing, and standard audio can be tied to different physical devices at different times without making any changes to programs. Although devices may differ dramatically, we can compose programs that can do I/O without depending on the properties of specific devices. From this point forward, we will call functions from stdio, stddraw, and/or stdaudio in nearly every program in this book, and you will use them in nearly all of your programs. One important advantage of using such modules is that you can switch to new devices that are faster or cheaper, or hold more data, without changing your program at all. In such a situation, the details of the connection are a matter to be resolved between your operating system and the booksite module implementations. On modern systems, new devices are typically supplied with software that resolves such details automatically for both the operating system and for Python.

Conceptually, one of the most significant features of the standard input, standard output, standard drawing, and standard audio data streams is that they are infinite: from the point of view of your program, there is no limit on their length. This point of view leads to more than just programs that have a long useful life (because they are less sensitive to changes in technology than programs with built-in limits). It also is related to the Turing machine, an abstract device used by theoretical computer scientists to help us understand fundamental limitations on the capabilities of real computers. One of the essential properties of the model is the idea of a finite discrete device that works with an unlimited amount of input and output.

Q&A

Q. How can I make the booksite modules stdio, stddraw, and stdaudio available to Python?

A. If you followed the step-by-step instructions on the booksite for installing Python, these module should already be available to Python. Note that copying the files stddraw.py and stdaudio.py from the booksite and putting them in the same directory as the programs that use them is insufficient because they rely upon a library (set of modules) named Pygame for graphics and audio.

Q. Are there standard Python modules for handling standard output?

A. Actually, such features are built into Python. In Python 2, you can use the print statement to write data to stdout. In Python 3, there is no print statement; instead, there is a print() function, which is similar.

Q. Why, then, are we using the booksite stdio module for writing to standard output instead of using the features already provided by Python?

A. Our intention is to compose code that works (as much as possible) with all versions of Python. For example, using the print statement in all our programs would mean they would work with Python 2, but not with Python 3. Since we use stdio functions, we just need to make sure that we have the proper library.

Q. How about standard input?

A. There are (different) capabilities in Python 2 and Python 3 that correspond to stdio.readLine(), but nothing corresponding to stdio.readInt() and similar functions. Again, by using stdio, we can compose programs that not just take advantage of these additional capabilities, but also work in both versions of Python.

Q. How about drawing and sound?

A. Python does not come with an audio library. Python comes with a graphics library named Tkinter for producing drawings, but it is too slow for some of the graphics applications in the book. Our stddraw and stdaudio modules provide easy-to-use APIs, based on the Pygame library.

Q. So, let me get this straight; if I use the format %2.4f with stdio.writef() to write a float, I get two digits before the decimal point and four digits after the decimal point, right?

A. No, that specifies just four digits after the decimal point. The number preceding the decimal point is the width of the whole field. You want to use the format %7.2f to specify seven characters in total—four before the decimal point, the decimal point itself, and two digits after the decimal point.

Q. Which other conversion codes are there for stdio.writef()?

A. For integers, there is o for octal and x for hexadecimal. There are also numerous formats for dates and times. See the booksite for more information.

Q. Can my program reread data from standard input?

A. No. You get only one shot at it, in the same way that you cannot undo a call of stdio.writeln().

Q. What happens if my program attempts to read data from standard input after it is exhausted?

A. Python will raise an EOFError at run time. The functions stdio.isEmpty() and stdio.hasNextLine() allow you to avoid such an error by checking whether more input is available.

Q. Why does stddraw.square(x, y, r) draw a square of width 2r instead of r?

A. This makes it consistent with the function stddraw.circle(x, y, r), where the third argument is the radius of the circle, not the diameter. In this context, r is the radius of the biggest circle that can fit inside the square.

Q. What happens if my program calls stddraw.show(0)?

A. That function call tells stddraw to copy the background canvas to the standard drawing window, and then wait 0 milliseconds (that is, do not wait at all) before proceeding. That function call is appropriate if, for example, you want to run an animation at the fastest rate supported by your computer.

Q. Can I draw curves other than circles with stddraw?

A. We had to draw the line somewhere (pun intended), so we support only the basic shapes discussed in the text. You can draw other shapes one point at a time, as explored in several exercises in the text, but filling them is not directly supported.

Q. So I use negative integers to go below concert A when making input files for playthattune.py?

A. Right. Actually, our choice to put concert A at 0 is arbitrary. A popular standard, known as the MIDI Tuning Standard, starts numbering at the C five octaves below concert A. By that convention, concert A is 69 and you do not need to use negative numbers.

Q. Why do I hear weird results from standard audio when I try to sonify a sine wave with a frequency of 30,000 hertz (or more)?

A. The Nyquist frequency, defined as one-half the sampling frequency, represents the highest frequency that can be reproduced. For standard audio, the sampling frequency is 44,100, so the Nyquist frequency is 22,050.

Exercises

1.5.1 Compose a program that reads in integers (as many as the user enters) from standard input and writes the maximum and minimum values to standard output.

1.5.2 Modify your program from the previous exercise to insist that the integers must be positive (by prompting the user to enter positive integers whenever the value entered is not positive).

1.5.3 Compose a program that accepts an integer n from the command line, reads n floats from standard input, and writes their mean (average value) and standard deviation (square root of the sum of the squares of their differences from the average, divided by n) to standard output.

1.5.4 Extend your program from the previous exercise to create a filter that writes all the floats in standard input that are more than 1.5 standard deviations from the mean.

1.5.5 Compose a program that reads in a sequence of integers and writes both the integer that appears in a longest consecutive run and the length of the run. For example, if the input is 1 2 2 1 5 1 1 7 7 7 7 1 1, then your program should write Longest run: 4 consecutive 7s.

1.5.6 Compose a filter that reads in a sequence of integers and writes the integers, removing repeated values that appear consecutively. For example, if the input is 1 2 2 1 5 1 1 7 7 7 7 1 1 1 1 1 1 1 1 1, your program should write 1 2 1 5 1 7 1.

1.5.7 Compose a program that takes a command-line argument n, reads from standard input N-1 distinct integers between 1 and n, and determines the missing integer.

1.5.8 Compose a program that reads in positive real numbers from standard input and writes their geometric and harmonic means. The geometric mean of n positive numbers x1, x2, ..., xn is (x1 × x2 × ... × xn)1/n. The harmonic mean is (1/x1 + 1/x2 + ... + 1/xn) / (1/n). Hint: For the geometric mean, consider taking logarithms to avoid overflow.

1.5.9 Suppose that the file in.txt contains the two strings F and F, and consider the following program (dragon.py):

import stdio
dragon = stdio.readString()
nogard = stdio.readString()
stdio.write(dragon + 'L' + nogard)
stdio.write(' ')
stdio.write(dragon + 'R' + nogard)
stdio.writeln()

What does the following command do (see EXERCISE 1.2.35)?

python dragon.py < in.txt | python dragon.py | python dragon.py

1.5.10 Compose a filter tenperline.py that reads a sequence of integers between 0 and 99 and writes 10 integers per line, with columns aligned. Then compose a program randomintseq.py that takes two command-line arguments m and n and outputs n random integers between 0 and m-1. Test your programs with the command python randomintseq.py 200 100 | python tenperline.py.

1.5.11 Compose a program that reads in text from standard input and writes the number of words in the text. For the purpose of this exercise, a word is a sequence of non-whitespace characters that is surrounded by whitespace.

1.5.12 Compose a program that reads in lines from standard input with each line containing a name and two integers and then uses writef() to write a table with a column of the names, the integers, and the result of dividing the first by the second, accurate to three decimal places. You could use a program like this to tabulate batting averages for baseball players or grades for students.

1.5.13 Which of the following require saving all the values from standard input (in an array, say), and which could be implemented as a filter using only a fixed number of variables? For each, the input comes from standard input and consists of n floats between 0 and 1.

• Write the maximum and minimum float.

• Write the k th smallest float.

• Write the sum of the squares of the floats.

• Write the average of the n floats.

• Write the percentage of floats greater than the average.

• Write the n floats in increasing order.

• Write the n floats in random order.

1.5.14 Compose a program that writes a table of the monthly payments, remaining principal, and interest paid for a loan, taking three numbers as command-line arguments: the number of years, the principal, and the interest rate (see EXERCISE 1.2.24).

1.5.15 Compose a program that takes three command-line arguments x, y, and z, reads from standard input a sequence of point coordinates (xi, yi, zi), and writes the coordinates of the point closest to (x, y, z). Recall that the square of the distance between (x, y, z) and (xi, yi, zi) is (xxi)2 + (yyi)2 + (zzi)2. For efficiency, do not use either math.sqrt() or the ** operator.

1.5.16 Compose a program that, given the positions and masses of a sequence of objects, computes their center-of-mass, or centroid. The centroid is the average position of the n objects, weighted by mass. If the positions and masses are given by (xi, yi, mi), then the centroid (x, y, m) is given by

m = m1 + m2 + ... + mn

x = (m1 x1 + ... + mn xn) / m

y = (m1 y1 + ... + mn yn) / m

1.5.17 Compose a program that reads in a sequence of real numbers between –1 and +1 and writes their average magnitude, average power, and the number of zero crossings. The average magnitude is the average of the absolute values of the data values. The average power is the average of the squares of the data values. The number of zero crossings is the number of times a data value transitions from a strictly negative number to a strictly positive number, or vice versa. These three statistics are widely used to analyze digital signals.

1.5.18 Compose a program that takes a command-line argument n and plots an n-by-n checkerboard with red and black squares. Color the lower-left square red.

1.5.19 Compose a program that takes as command-line arguments an integer n and a float p (between 0 and 1), plots n equally spaced points of size on the circumference of a circle, and then, with probability p for each pair of points, draws a gray line connecting them.

Image

1.5.20 Compose code to draw hearts, spades, clubs, and diamonds. To draw a heart, draw a diamond, then attach two semicircles to the upper-left and upper-right sides.

1.5.21 Compose a program that takes a command-line argument n and plots a “flower” with n petals (if n is odd) or 2n petals (if n is even), by plotting the polar coordinates (r, θ) of the function r = sin(n θ) for θ ranging from 0 to 2π radians.

Image

1.5.22 Compose a program that takes a string s from the command line and displays it in banner style on the screen, moving from left to right and wrapping back to the beginning of the string as the end is reached. Add a second command-line argument to control the speed.

1.5.23 Modify playthattune.py to take additional command-line arguments that control the volume (multiply each sample value by the volume) and the tempo (multiply each note’s duration by the tempo).

1.5.24 Compose a program that takes the name of a .wav file and a playback rate r as command-line arguments and plays the file at the given rate. First, use stdaudio.read() to read the file into an array a[]. If r = 1, just play a[]; otherwise, create a new array b[] of approximate size r times a.length. If r < 1, populate b[] by sampling from the original; if r > 1, populate b[] by interpolating from the original. Then play b[].

1.5.25 Compose programs that use stddraw to create each of these designs.

Image

1.5.26 Compose a program circles.py that draws filled circles of random size at random positions in the unit square, producing images like those below. Your program should take four command-line arguments: the number of circles, the probability that each circle is black, the minimum radius, and the maximum radius.

Image

Creative Exercises

1.5.27 Visualizing audio. Modify playthattune.py (PROGRAM 1.5.8) to send the values played to standard drawing, so that you can watch the sound waves as they are played. You will have to experiment with plotting multiple curves in the drawing canvas to synchronize the sound and the picture.

1.5.28 Statistical polling. When collecting statistical data for certain political polls, it is very important to obtain an unbiased sample of registered voters. Assume that you have a file with n registered voters, one per line. Compose a filter that writes a random sample of size m (see sample.py, PROGRAM 1.4.1).

1.5.29 Terrain analysis. Suppose that a terrain is represented by a two-dimensional grid of elevation values (in meters). A peak is a grid point whose four neighboring cells (left, right, up, and down) have strictly lower elevation values. Compose a program peaks.py that reads a terrain from standard input and then computes and writes the number of peaks in the terrain.

1.5.30 Histogram. Suppose that the standard input stream is a sequence of floats. Compose a program that takes an integer n and two floats lo and hi from the command line and uses stddraw to plot a histogram of the count of the numbers in the standard input stream that fall into each of the n intervals defined by dividing the interval (lo, hi) into n equal-sized intervals.

1.5.31 Spirographs. Compose a program that takes three parameters R, r, and a from the command line and draws the resulting spirograph. A spirograph (technically, an epicycloid) is a curve formed by rolling a circle of radius r around a larger fixed circle of radius R. If the pen offset from the center of the rolling circle is (r+a), then the equation of the resulting curve at time t is given by

x(t) = (R + r) cos (t) – (r + a) cos ((R + r)t /r)

y(t) = (R + r) sin (t) – (r + a) sin ((R + r)t /r)

Such curves were popularized by a best-selling toy that contains discs with gear teeth on the edges and small holes that you could put a pen in to trace spirographs.

1.5.32 Clock. Compose a program that displays an animation of the second, minute, and hour hands of an analog clock. Use the call stddraw.show(1000) to update the display roughly once per second.

1.5.33 Oscilloscope. Compose a program to simulate the output of an oscilloscope and produce Lissajous patterns. These patterns are named after the French physicist, Jules A. Lissajous, who studied the patterns that arise when two mutually perpendicular periodic disturbances occur simultaneously. Assume that the inputs are sinusoidal, so that the following parametric equations describe the curve:

x(t) = Ax sin (wx t + θx)

y(t) = A y sin (wy t + θy)

Take the six parameters Ax and Ay (amplitudes); wx and wy (angular velocity); and θx and θy (phase factors) from the command line.

1.5.34 Bouncing ball with tracks. Modify bouncingball.py to produce images like the ones shown in the text, which show the track of the ball on a gray background.

1.5.35 Bouncing ball with gravity. Modify bouncingball.py to incorporate gravity in the vertical direction. Add calls to stdaudio.playFile() to add one sound effect when the ball hits a wall and a different one when it hits the floor.

1.5.36 Random tunes. Compose a program that uses stdaudio to play random tunes. Experiment with keeping in key, assigning high probabilities to whole steps, repetition, and other rules to produce reasonable melodies.

1.5.37 Tile patterns. Using your solution to EXERCISE 1.5.25, compose a program tilepattern.py that takes a command-line argument n and draws an n-by-n pattern, using the tile of your choice. Add a second command-line argument that adds a checkerboard option. Add a third command-line argument for color selection. Using the patterns on the facing page as a starting point, design a tile floor. Be creative! Note: These are all designs from antiquity that you can find in many ancient (and modern) buildings.

Image

1.6 Case Study: Random Web Surfer

Communicating across the web has become an integral part of everyday life. This communication is enabled in part by scientific studies of the structure of the web, a subject of active research since its inception. We next consider a simple model of the web that has proved to be a particularly successful approach to understanding some of its properties. Variants of this model are widely used and have been a key factor in the explosive growth of search applications on the web.

The model, which is known as the random surfer model, is simple to describe. We consider the web to be a fixed set of pages, with each page containing a fixed set of links, and each link a reference to some other page. We study what happens to a person (the random surfer) who randomly moves from page to page, either by typing a page name into the address bar or by clicking a link on the current page.

The underlying mathematical model behind the web model is known as the graph, which we will consider in detail at the end of the book (in SECTION 4.5). We defer discussion of the details of processing graphs until then. Instead, we concentrate on calculations associated with a natural and well-studied probabilistic model that accurately describes the behavior of the random surfer.

The first step in studying the random surfer model is to formulate it more precisely. The crux of the matter is to specify what it means to randomly move from page to page. The following intuitive 90-10 rule captures both methods of moving to a new page: Assume that 90 percent of the time the random surfer clicks a random link on the current page (each link chosen with equal probability) and that 10 percent of the time the random surfer goes directly to a random page (all pages on the web chosen with equal probability).

Image

You can immediately see that this model has flaws, because you know from your own experience that the behavior of a real web surfer is not quite so simple:

• No one chooses links or pages with equal probability.

• There is no real potential to surf directly to each page on the web.

• The 90-10 (or any fixed) breakdown is just a guess.

• It does not take the “back” button or bookmarks into account.

• We can afford to work with only a small sample of the web.

Despite these flaws, the model is sufficiently rich that computer scientists have learned a great deal about properties of the web by studying it. To appreciate the model, consider the small example on the previous page. Which page do you think the random surfer is most likely to visit?

Each person using the web behaves a bit like the random surfer, so understanding the fate of the random surfer is of intense interest to people building web infrastructure and web applications. The model is a tool for understanding the experience of each of the billions of web users. In this section, you will use the basic programming tools from this chapter to study the model and its implications.

Input format

We want to be able to study the behavior of the random surfer on various web models, not just one example. Consequently, we want to compose data-driven code, where we keep data in files and compose programs that read the data from standard input. The first step in this approach is to define an input format that we can use to structure the information in the input files. We are free to define any convenient input format.

Image

Later in the book, you will learn how to read web pages in Python programs (SECTION 3.1) and to convert from names to numbers (SECTION 4.4) as well as other techniques for efficient graph processing. For now, we assume that there are n web pages, numbered from 0 to n – 1, and we represent links with ordered pairs of such numbers, the first specifying the page containing the link and the second specifying the page to which it refers. Given these conventions, a straightforward input format for the random surfer problem is an input stream consisting of an integer (the value of n) followed by a sequence of pairs of integers (the representations of all the links). Because of the way stdio reading functions treat whitespace characters, we are free to either put one link per line or arrange them several to a line.

Transition matrix

We use a two-dimensional matrix, which we refer to as the transition matrix, to completely specify the behavior of the random surfer. With n web pages, we define an n-by-n matrix such that the element in row i and column j is the probability that the random surfer moves to page j when on page i. Our first task is to compose code that can create such a matrix for any given input. When we apply the 90-10 rule, this computation is not difficult. We do so in three steps:

• Read n, and then create arrays linkCounts[][] and outDegrees[].

• Read the links and accumulate counts so that linkCounts[i][j] counts links from i to j and outDegrees[i] counts links from i to anywhere.

• Use the 90-10 rule to compute the probabilities.

The first two steps are elementary, and the third is not much more difficult: multiply linkCounts[i][j] by 0.90/outDegrees[i] if there is a link from i to j (take a random link with probability 0.9), and then add 0.10/n to each element (go to a random page with probability 0.1). PROGRAM 1.6.1 (transition.py) performs this calculation: it is a filter that converts the list-of-links representation of a web model into a transition-matrix representation.

The transition matrix is significant because each row represents a discrete probability distribution—the elements fully specify the behavior of the random surfer’s next move, giving the probability of surfing to each page. Note in particular that the elements sum to 1 (the surfer always goes somewhere).

The output of transition.py defines another file format, one for matrices of float values: the numbers of rows and columns followed by the values for matrix elements. Now, we can compose code that reads and processes transition matrices.

Image

Program 1.6.1 Computing the transition matrix (transition.py)

import stdarray
import stdio

n = stdio.readInt()

linkCounts = stdarray.create2D(n, n, 0)
outDegrees = stdarray.create1D(n, 0)

while not stdio.isEmpty():
    # Accumulate link counts.
    i = stdio.readInt()
    j = stdio.readInt()
    outDegrees[i] += 1
    linkCounts[i][j] += 1

stdio.writeln(str(n) + ' ' + str(n))

for i in range(n):
    # Write probability distribution for row i.
    for j in range(n):
        # Write probability for column j.
        p = (0.90 * linkCounts[i][j] / outDegrees[i]) + (0.10 / n)
        stdio.writef('%8.5f', p)
    stdio.writeln()


        n        | # pages
linkCounts[i][j] | # links from page i to page j
  outDegrees[i]  | # links out of page i
        p        | transition probability

This program is a filter that reads links from standard input and writes the corresponding transition matrix to standard output. First, it processes the input to count the links from each page. Then it applies the 90-10 rule to compute the transition matrix (see text). It assumes that there are no pages in the input that have zero outdegrees (see EXERCISE 1.6.3).


% more tiny.txt
5
0 1
1 2  1 2
1 3  1 3 1 4
2 3
3 0
4 0  4 2


% python transition.py < tiny.txt
5 5
 0.02000 0.92000 0.02000 0.02000 0.02000
 0.02000 0.02000 0.38000 0.38000 0.20000
 0.02000 0.02000 0.02000 0.92000 0.02000
 0.92000 0.02000 0.02000 0.02000 0.02000
 0.47000 0.02000 0.47000 0.02000 0.02000


Simulation

Given the transition matrix, simulating the behavior of the random surfer involves surprisingly little code, as you can see in randomsurfer.py (PROGRAM 1.6.2). This program reads a transition matrix and surfs according to the rules, starting at page 0 and taking the number of moves as a command-line argument. It counts the number of times that the surfer visits each page. Dividing that count by the number of moves yields an estimate of the probability that a random surfer winds up on the page. This probability is known as the page’s rank. In other words, randomsurfer.py computes an estimate of all page ranks.

One random move

The key to the computation is the random move, which is specified by the transition matrix. We maintain a variable page whose value is the current location of the surfer. Row page of the matrix p[][] gives, for each j, the probability that the surfer next goes to j. In other words, when the surfer is at page, our task is to generate a random integer between 0 and n-1 according to the distribution given by row page in the transition matrix (the one-dimensional array p[page]). How can we accomplish this task? We can use random.random() to generate a random float r between 0 and 1, but how does that help us get to a random page? One way to answer this question is to think of the probabilities in row page as defining a set of n intervals in (0, 1), with each probability corresponding to an interval length. Then our random variable r falls into one of the intervals, with probability precisely specified by the interval length. This reasoning leads to the following code:

total = 0.0
for j in range(0, n)
    total += p[page][j]
    if r < total:
        page = j
        break

Image

The variable total tracks the endpoints of the intervals defined in row p[page], and the for loop finds the interval containing the random value r. For example, suppose that the surfer is at page 4 in our example. The transition probabilities are 0.47, 0.02, 0.47, 0.02, and 0.02, and total takes on the values 0.0, 0.47, 0.49, 0.96, 0.98, and 1.0. These values indicate that the probabilities define the five intervals (0, 0.47), (0.47, 0.49), (0.49, 0.96), (0.96, 0.98), and (0.98, 1), one for each page. Now, suppose that random.random() returns the value 0.71. We increment j from 0 to 1 to 2 and stop there, which indicates that 0.71 is in the interval (0.49, 0.96), so we send the surfer to the third page (page 2). Then, we perform the same computation for p[2], and the random surfer is off and surfing. For large n, we can use binary search to substantially speed up this computation (see EXERCISE 4.2.35). Typically, we are interested in speeding up the search in this situation because we are likely to need a huge number of random moves, as you will see.


Program 1.6.2 Simulating a random surfer (randomsurfer.py)

import random
import sys
import stdarray
import stdio

moves = int(sys.argv[1])

n = stdio.readInt()
stdio.readInt()  # Not needed (another n).
p = stdarray.create2D(n, n, 0.0)
for i in range(n):
    for j in range(n):
        p[i][j] = stdio.readFloat()

hits = stdarray.create1D(n, 0)
page = 0 # Start at page 0.
for i in range(moves):
    r = random.random()     # Compute a random page
    total = 0.0             #   according to distribution
    for j in range(0, n):   #   in row p[page] (see text).
        total += p[page][j] #
        if r < total:       #
            page = j        #
            break           #
    hits[page] += 1

for v in hits:
    stdio.writef('%8.5f', 1.0 * v / moves)
stdio.writeln()


 moves  | # moves
  n     | # pages
 page   | current page
p[i][j] | probability that the surfer moves from page i to page j
hits[i] | # times the surfer hits page i

This program uses a transition matrix to simulate the behavior of a random surfer. It accepts the number of moves as a command-line argument, reads the transition matrix, performs the indicated number of moves as prescribed by the matrix, and writes the relative frequency of hitting each page. The key to the computation is the random move to the next page (see text).


% python transition.py < tiny.txt | python randomsurfer.py 100
 0.24000 0.23000 0.16000 0.25000 0.12000
% python transition.py < tiny.txt | python randomsurfer.py 10000
 0.27280 0.26530 0.14820 0.24830 0.06540
% python transition.py < tiny.txt | python randomsurfer.py 1000000
 0.27324 0.26568 0.14581 0.24737 0.06790


Markov chains

The random process that describes the surfer’s behavior is known as a Markov chain, named after the Russian mathematician Andrey Markov, who developed the concept in the early 20th century. Markov chains are widely applicable, are well studied, and have many remarkable and useful properties. For example, you might have wondered why randomsurfer.py starts the random surfer at page 0, whereas you might have expected a random choice. A basic limit theorem for Markov chains says that the surfer could start anywhere, because the probability that a random surfer eventually winds up on any particular page is the same for all starting pages! No matter where the surfer starts, the process eventually stabilizes to a point where further surfing provides no further information. This phenomenon is known as mixing. Although this phenomenon might seem counterintuitive at first, it explains coherent behavior in a situation that might seem chaotic. In the present context, it captures the idea that the web looks pretty much the same to everyone after surfing for a sufficiently long time.

Not all Markov chains have this mixing property. For example, if we eliminate the random leap from our model, certain configurations of web pages can present problems for the surfer. Indeed, there exist on the web sets of pages known as spider traps, which are designed to attract incoming links but have no outgoing links. Without the random leap, the surfer could get stuck in a spider trap. The primary purpose of the 90-10 rule is to guarantee mixing and eliminate such anomalies.

Page ranks

The randomsurfer.py simulation is straightforward: it loops for the indicated number of moves, randomly surfing through the graph. Because of the mixing phenomenon, increasing the number of iterations gives increasingly accurate estimates of the probability that the surfer lands on each page (the page ranks). How do the results compare with your intuition when you first thought about the question? You might have guessed that page 4 was the lowest-ranked page, but did you think that pages 0 and 1 would rank higher than page 3? If we want to know which page is the highest rank, we need more precision and more accuracy. The randomsurfer.py program needs 10d moves to get answers precise to d decimal places and many more moves for those answers to stabilize to an accurate value. For our example, it takes tens of thousands of iterations to get answers accurate to two decimal places and millions of iterations to get answers accurate to three places (see EXERCISE 1.6.5). The end result is that page 0 beats page 1 by 27.3 percent to 26.6 percent. That such a tiny difference would appear in such a small problem is quite surprising: if you guessed that page 0 is the most likely spot for the surfer to end up, you were lucky!

Accurate page rank estimates for the web are valuable in practice for many reasons. First, using them to put in order the pages that match the search criteria for web searches proved to be vastly more in line with people’s expectations than previous methods. Next, this measure of confidence and reliability led to the investment of huge amounts of money in web advertising based on page ranks. Even in our tiny example, page ranks might be used to convince advertisers to pay up to four times as much to place an ad on page 0 as on page 4. Computing page ranks is mathematically sound, an interesting computer science problem, and big business, all rolled into one.

Image
Visualizing the histogram

With stddraw, it is easy to create a visual representation that can give you a feeling for how the random surfer visit frequencies converge to the page ranks. If you scale the x- and y-coordinates in the standard drawing window appropriately and add this code

if i % 1000 == 0:
    stddraw.clear()
    for k in range(n):
        stddraw.filledRectangle(k - 0.25, 0.0, 0.5, hits[k])
stddraw.show(10)

to the random move loop in randomsurfer.py and run the code for, say, millions of moves, you will see a drawing of the frequency histogram that eventually stabilizes to the page ranks. (The constants 1000 and 10 in this code are a bit arbitrary; you might wish to change them when you run the code.) After you have used this tool once, you are likely to find yourself using it every time you want to study a new model (perhaps with some minor adjustments to handle larger models).

Studying other models

Our programs randomsurfer.py and transition.py are excellent examples of data-driven programs. You can easily create a data model just by creating a file like tiny.txt that starts with an integer n and then specifies pairs of integers between 0 and n-1 that represent links connecting pages. You are encouraged to run it for various data models as suggested in the exercises, or make up some models of your own to study. If you have ever wondered how web page ranking works, this calculation is your chance to develop better intuition about what causes one page to be ranked more highly than another. Which kind of page is likely to be rated highly? One that has many links to other pages, or one that has just a few links to other pages? The exercises in this section present many opportunities to study the behavior of the random surfer. Since randomsurfer.py uses standard input, you can compose simple programs that generate large input models, pipe their output to randomsurfer.py, and thereby study the random surfer on large models. Such flexibility is an important reason to use standard input and standard output.

Directly simulating the behavior of a random surfer to understand the structure of the web is appealing, but it has limitations. Think about the following question: Could you use it to compute page ranks for a web model with millions (or billions!) of web pages and links? The quick answer to this question is no, because you cannot even afford to store the transition matrix for such a large number of pages. A matrix for millions of pages would have trillions of elements. Do you have that much space on your computer? Could you use randomsurfer.py to find page ranks for a smaller model with, say, thousands of pages? To answer this question, you might run multiple simulations, record the results for a large number of trials, and then interpret those experimental results. We do use this approach for many scientific problems (the gambler’s ruin problem is one example; SECTION 2.4 is devoted to another), but it can be very time-consuming, as a huge number of trials may be necessary to get the desired accuracy. Even for our tiny example, we saw that it takes millions of iterations to get the page ranks accurate to three or four decimal places. For larger models, the required number of iterations to obtain accurate estimates becomes truly huge.

Mixing a Markov chain

It is important to remember that the page ranks are a property of the web model, not any particular approach for computing it. That is, randomsurfer.py is just one way to compute page ranks. Fortunately, a simple computational model based on a well-studied area of mathematics provides a far more efficient approach than simulation to the problem of computing page ranks. That model makes use of the basic arithmetic operations on two-dimensional matrices that we considered in SECTION 1.4.

Squaring a Markov chain

What is the probability that the random surfer will move from page i to page j in two moves? The first move goes to an intermediate page k, so we calculate the probability of moving from i to k and then from k to j for all possible k and add up the results. For our example, the probability of moving from 1 to 2 in two moves is the probability of moving from 1 to 0 to 2 (0.02 × 0.02), plus the probability of moving from 1 to 1 to 2 (0.02 × 0.38), plus the probability of moving from 1 to 2 to 2 (0.38 × 0.02), plus the probability of moving from 1 to 3 to 2 (0.38 × 0.02), plus the probability of moving from 1 to 4 to 2 (0.20 × 0.47), which adds up to a grand total of 0.1172. The same process works for each pair of pages. This calculation is one that we have seen before, in the definition of matrix multiplication: the element in row i and column j in the result is the dot product of row i and column j in the original. In other words, the result of multiplying p[][] by itself is a matrix where the element in row i and column j is the probability that the random surfer moves from page i to page j in two moves. Studying the elements of the two-move transition matrix for our example is well worth your time and will help you better understand the movement of the random surfer. For instance, the largest element in the square is the one in row 2 and column 0, reflecting the fact that a surfer starting on page 2 has only one link out, to page 3, where there is also only one link out, to page 0. Therefore, by far the most likely outcome for a surfer starting on page 2 is to end up in page 0 after two moves. All of the other two-move routes involve more choices and are less probable. It is important to note that this is an exact computation (up to the limitations of Python’s floating-point precision), in contrast to randomsurfer.py, which produces an estimate and needs more iterations to get a more accurate estimate.

Image
The power method

We might then calculate the probabilities for three moves by multiplying by p[][] again, and for four moves by multiplying by p[][] yet again, and so forth. However, matrix–matrix multiplication is expensive, and we are actually interested in a vector–matrix calculation. For our example, we start with the vector

[1.0 0.0 0.0 0.0 0.0]

which specifies that the random surfer starts on page 0. Multiplying this vector by the transition matrix gives the vector

[.02 .92 .02 .02 .02]

which is the probabilities that the surfer winds up on each of the pages after one step. Now, multiplying this vector by the transition matrix gives the vector

[.05 .04 .36 .37 .19]

which contains the probabilities that the surfer winds up on each of the pages after two steps. For example, the probability of moving from 0 to 2 in two moves is the probability of moving from 0 to 0 to 2 (0.02 × 0.02), plus the probability of moving from 0 to 1 to 2 (0.92 × 0.38), plus the probability of moving from 0 to 2 to 2 (0.02 × 0.02), plus the probability of moving from 0 to 3 to 2 (0.02 × 0.02), plus the probability of moving from 0 to 4 to 2 (0.02 × 0.47), which adds up to a grand total of 0.36. From these initial calculations, the pattern is clear: the vector giving the probabilities that the random surfer is at each page after t steps is precisely the product of the corresponding vector for t – 1 steps and the transition matrix. By the basic limit theorem for Markov chains, this process converges to the same vector no matter where we start; in other words, after a sufficient number of moves, the probability that the surfer ends up on any given page is independent of the starting point.

PROGRAM 1.6.3 (markov.py) is an implementation that you can use to check convergence for our example. For instance, it gets the same results (the page ranks accurate to two decimal places) as randomsurfer.py, but with just 20 matrix–vector multiplications instead of the tens of thousands of iterations needed by randomsurfer.py. Another 20 multiplications gives the results accurate to three decimal places, as compared with millions of iterations for randomsurfer.py, and just a few more multiplications give the results to full precision (see EXERCISE 1.6.6).

Image

Program 1.6.3 Mixing a Markov chain (markov.py)

import sys
import stdarray
import stdio

moves = int(sys.argv[1])
n = stdio.readInt()
stdio.readInt()

p = stdarray.create2D(n, n, 0.0)
for i in range(n):
    for j in range(n):
        p[i][j] = stdio.readFloat()

ranks = stdarray.create1D(n, 0.0)
ranks[0] = 1.0
for i in range(moves):
    newRanks = stdarray.create1D(n, 0.0)
    for j in range(n):
        for k in range(n):
            newRanks[j] += ranks[k] * p[k][j]
    ranks = newRanks

for i in range(n):
    stdio.writef('%8.5f', ranks[i])
stdio.writeln()


  moves    | number of iterations
    n      | number of pages
  p[][]    | transition matrix
 ranks[]   | page ranks
newRanks[] | new page ranks

This program takes a command-line integer moves, reads a transition matrix from standard input, computes the probabilities that a random surfer lands on each page (page ranks) after moves matrix–vector multiplications, and writes the page ranks to standard output.


% python transition.py < tiny.txt | python markov.py 20
 0.27245 0.26515 0.14669 0.24764 0.06806

% python transition.py < tiny.txt | python markov.py 40
 0.27303 0.26573 0.14618 0.24723 0.06783


Image

Markov chains are well studied, but their impact on the web was not truly felt until 1998, when two graduate students, Sergey Brin and Lawrence Page, had the audacity to build a Markov chain and compute the probabilities that a random surfer hits each page for the whole web. Their work revolutionized web search and is the basis for the page ranking method used by Google, the highly successful web search company that they founded. Specifically, the company periodically recomputes the random surfer’s probability for each page. Then, when you do a search, it lists the pages related to your search keywords in order of these ranks. Such page ranks now predominate because they somehow correspond to the expectations of typical web users, reliably providing them with relevant web pages for typical searches. The computation that is involved is enormously time-consuming, due to the huge number of pages on the web, but the result has turned out to be enormously profitable and well worth the expense. The method used in markov.py is far more efficient than simulating the behavior of a random surfer, but it is still too slow to actually compute the probabilities for a huge matrix corresponding to all the pages on the web. That computation is enabled by better data structures for graphs (see CHAPTER 4).

Lessons

Developing a full understanding of the random surfer model is beyond the scope of this book. Instead, our purpose is to show you an application that involves composing a bit more code than the short programs that we have been using to teach specific concepts. Which specific lessons can we learn from this case study?

We already have a full computational model

Built-in data types, combined with conditionals and loops, arrays, and standard input/output, enable you to address interesting problems of all sorts. Indeed, it is a basic precept of theoretical computer science that this model suffices to specify any computation that can be performed on any reasonable computing device. In the next two chapters, we discuss two critical ways in which the model has been extended to drastically reduce the amount of time and effort required to develop large and complex programs.

Data-driven code is prevalent

The concept of using standard input and output streams and saving data in files is a powerful one. We compose filters to convert from one kind of input to another, generators that can produce huge input files for study, and programs that can handle a wide variety of models. We can save data for archiving or later use. We can also process data derived from some other source and then save it in a file, whether it is from a scientific instrument or a distant website. The concept of data-driven code is an easy and flexible way to support this suite of activities.

Accuracy can be elusive

It is a mistake to assume that a program produces accurate answers simply because it can write numbers to many decimal places of precision. Often, the most difficult challenge that we face is ensuring that we have accurate answers.

Uniform random numbers are just a start

When we speak informally about random behavior, we often are thinking of something more complicated than the “every value equally likely” model that random.random() gives us. Many of the problems that we consider involve working with random numbers from other distributions, such as randomsurfer.py.

Efficiency matters

It is also a mistake to assume that your computer is so fast that it can do any computation. Some problems require much more computational effort than others. CHAPTER 4 is devoted to a thorough discussion of evaluating the performance of the programs that you compose. We defer detailed consideration of such issues until then, but remember that you always need to have some general idea of the performance requirements of your programs.

Perhaps the most important lesson to learn from composing programs for complicated problems like the example in this section is that debugging is difficult. The polished programs in the book mask that lesson, but you can rest assured that each one is the product of a long bout of testing, fixing bugs, and running the program on numerous inputs. Generally we avoid describing bugs and the process of fixing them in the text because that makes for a boring account and overly focuses attention on bad code, but you can find some examples and descriptions in the exercises and on the booksite.

Exercises

1.6.1 Modify transition.py to take the leap probability from the command line and use your modified version to examine the effect on page ranks of switching to an 80-20 rule or a 95-5 rule.

1.6.2 Modify transition.py to ignore the effect of multiple links. That is, if there are multiple links from one page to another, count them as one link. Create an example that shows how this modification can change the order of page ranks.

1.6.3 Modify transition.py to handle pages with no outgoing links, by filling rows corresponding to such pages with the value 1/n.

1.6.4 The code fragment in randomsurfer.py that generates the random move fails if the probabilities in the row p[page] do not add up to 1. Explain what happens in that case, and suggest a way to fix the problem.

1.6.5 Determine, to within a factor of 10, the number of iterations required by randomsurfer.py to compute page ranks to four decimal places and to five decimal places for tiny.txt.

1.6.6 Determine the number of iterations required by markov.py to compute page ranks to three decimal places, to four decimal places, and to ten decimal places for tiny.txt.

1.6.7 Download the file medium.txt from the booksite (which reflects the 50-page example depicted in this section) and add to it links from page 23 to every other page. Observe the effect on the page ranks, and discuss the result.

1.6.8 Add to medium.txt (see the previous exercise) links to page 23 from every other page, observe the effect on the page ranks, and discuss the result.

1.6.9 Suppose that your page is page 23 in medium.txt. Is there a link that you could add from your page to some other page that would raise the rank of your page?

1.6.10 Suppose that your page is page 23 in medium.txt. Is there a link that you could add from your page to some other page that would lower the rank of that page?

1.6.11 Use transition.py and randomsurfer.py to determine the transition probabilities for the eight-page example shown below.

1.6.12 Use transition.py and markov.py to determine the transition probabilities for the eight-page example shown below.

Image

Creative Exercises

1.6.13 Matrix squaring. Compose a program like markov.py that computes page ranks by repeatedly squaring the matrix, thus computing the sequence p, p2, p4, p8, p16, and so forth. Verify that all of the rows in the matrix converge to the same values.

1.6.14 Random web. Compose a generator for transition.py that takes as input a page count n and a link count m and writes to standard output n followed by m random pairs of integers between 0 and n-1. (See SECTION 4.5 for a discussion of more realistic web models.)

1.6.15 Hubs and authorities. Add to your generator from the previous exercise a fixed number of hubs, which have links pointing to them from 10 percent of the pages, chosen at random, and authorities, which have links pointing from them to 10 percent of the pages. Compute page ranks. Which rank higher, hubs or authorities?

1.6.16 Page ranks. Design an array of pages and links where the highest-ranking page has fewer links pointing to it than some other page.

1.6.17 Hitting time. The hitting time for a page is the expected number of moves between times the random surfer visits the page. Run experiments to estimate page hitting times for tiny.txt, compare the results with page ranks, formulate a hypothesis about the relationship, and test your hypothesis on medium.txt.

1.6.18 Cover time. Compose a program that estimates the time required for the random surfer to visit every page at least once, starting from a random page.

1.6.19 Graphical simulation. Create a graphical simulation where the size of the dot representing each page is proportional to its rank. To make your program data-driven, design a file format that includes coordinates specifying where each page should be drawn. Test your program on medium.txt.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset