4.5 Case Study: Logging File Access
Solutions to Practice Problems
IN THIS CHAPTER, we focus on the Python tools and problem-solving patterns for processing text and files.
We take a running start by continuing the discussion of the string class we began in Chapter 2. We discuss, in particular, the extensive set of string methods that give Python powerful text-processing capabilities. We then go over the text-processing tools Python provides to control the format of output text.
After having mastered text processing, we cover files and file input/output (I/O) (i.e., how to read from and write to files from within a Python program).
Much of today's computing involves the processing of text content stored in files. We define several patterns for reading files that prepare the file content for processing.
Working with data coming interactively from the user or from a file introduces a source of errors for our program that we cannot really control. We go over the common errors that can occur. Finally, in this chapter's case study, we showcase the text-processing and I/O concepts introduced in the chapter in the context of an application that logs accesses to files.
In Chapter 2 we introduced the string class str. Our goal then was to show that Python supported values other than numbers. We showed how string operators make it possible to write string expressions and process strings in a way that is as familiar as writing algebraic expressions. We also used strings to introduce the indexing operator [].
In this section we cover strings and what can be done with them in more depth. We show, in particular, a more general version of the indexing operator and many of the commonly used string methods that make Python a strong text-processing tool.
We already know that a string value is represented as a sequence of characters that is enclosed within quotes, whether single or double quotes:
>>> “Hello, World!” ‘Hello, World!’ >>> ‘hello’ ‘hello’
Forgetting Quote Delimiters
A common mistake when writing a string value is to forget the quotes. If the quotes are omitted, the text will be treated as a name (e.g., a variable name) not a string value. Since, typically, there will be no value assigned to the variable, an error will result. Here is an example:
>>> hello Traceback (most recent call last): File “<pyshell#35>”, line 1, in <module> hello NameError: name ‘hello’ is not defined
The error message reported that name hello is not defined. In other words, the expression hello was treated as a variable, and the error was the result of trying to evaluate it.
If quotes delimit a string value, how do we construct strings that contain quotes? If the text contains a single quote, we can use double quote delimiters, and vice versa:
>>> excuse = ‘I am “sick”’ >>> fact = “I'm sick”
If the text contains both type of quotes, then the escape sequence ’ or ” is used to indicate that a quote is not the string delimiter but is part of the string value. So, if we want to create the string value
I'm “sick”.
we would write:
>>> excuse = ‘I'm “sick”’
Let's check whether this worked:
>>> excuse ‘I'm “sick”’
Well, this doesn't seem to work. We would like to see: I'm “sick”. Instead we still see the escape sequence ‘. To have Python print the string nicely, with the escape sequence ’ properly interpreted as an apostrophe, we need to use the print() function. The print() function takes as input an expression and prints it on the screen; in the case of a string expression, the print() function will interpret any escape sequence in the string and omit the string delimiters:
>>> print(excuse) I'm “sick”
In general, an escape sequence in a string is a sequence of characters starting with a that defines a special character and that is interpreted by function print().
String values defined with the single- or double-quote delimiters must be defined in a single line. If the string is to represent multiline text, we have two choices. One is to use triple quotes, as we do in this poem by Emily Dickinson:
>>> poem = ''' To make a prairie it takes a clover and one bee, One clover, and a bee, And revery. The revery alone will do If bees are few. '''
Let's see what the variable poem evaluates to:
>>> poem ‘ To make a prairie it takes a clover and one bee, - One clover , and a bee, And revery. The revery alone will do If bees are few. ’
We have here another example of string containing an escape sequence. The escape sequence stands in for a new line character, also When it appears in a string argument of the print() function, the new line escape sequence starts a new line:
>>> print(poem) To make a prairie it takes a clover and one bee, One clover, and a bee, And revery. The revery alone will do If bees are few.
Another way to create a multiline string is to encode the new line characters explicitly:
>>> poem = ‘ To make a prairie it takes a clover and one bee, - One clover, and a bee, And revery. The revery alone will do If bees are few. ’
In Chapter 2, we introduced the indexing operator []:
>>> s = ‘hello’ >>> s[0] ‘h’
The indexing operator takes an index i and returns the single-character string consisting of the character at index i.
The indexing operator can also be used to obtain a slice of a string. For example:
>>> s[0:2] ‘he’
The expression s[0:2] evaluates to the slice of string s starting at index 0 and ending before index 2. In general, s[i:j] is the substring of string s that starts at index i and ends at index j-1. Here are more examples, also illustrated in Figure 4.1:
>>> s[3:4] ‘l’ >>> s[-3:-1] ‘ll’
The last example shows how to get a slice using negative indexes: The substring obtained starts at index 3 and ends before index 1 (i.e., at index 2). If the slice we want starts at the first character of a string, we can drop the first index:
>>> s[:2] ‘he’
In order to obtain a slice that ends at the last character of a string, we must drop the second index:
>>> s[-3:] ‘llo’
Start by executing the assignment:
s = ‘0123456789’
Now write expressions using string s and the indexing operator that evaluate to:
Slicing Lists
The indexing operator is one of many operators that are shared between the string and the list classes. The indexing operator can also be used to obtain a slice of a list. For example, if pets is defined as
>>> pets = [‘goldfish’, ‘cat’, ‘dog’]
we can get slices of pets with the indexing operator:
>>> pets[:2] [‘goldfish’, ‘cat’] >>> pets[-3:-1] [‘goldfish’, ‘cat’] >>> pets[1:] [‘cat’, ‘dog’]
A slice of a list is a list. In other words, when the indexing operator is applied to a list with two arguments, it will return a list. Note that this is unlike the case when the indexing operator is applied to a list with only one argument, say an index i; in that case, the item of the list at index i is returned.
The string class supports a large number of methods. These methods provide the developer with a text-processing toolkit that simplifies the development of text-processing applications. Here we cover some of the more commonly used methods.
We start with the string method find(). When it is invoked on string s with one string input argument target, it checks whether target is a substring of s. If so, it returns the index (of the first character) of the first occurrence of string target; otherwise, it returns -1. For example, here is how method find() is invoked on string message using target string ‘top secret’:
>>> message = ‘‘‘This message is top secret and should not be divulged to anyone without top secret clearance’’’ >>> message.find(‘top secret’) 16
Index 16 is output by method find() since string ‘top secret’ appears in string message starting at index 16.
The method count(), when called by string s with string input argument target, returns the number of times target appears as a substring of s. For example:
>>> message.count(‘top secret’) 2
The value 2 is returned because string ‘top secret’ appears twice in message.
The function replace(), when invoked on string s, takes two string inputs, old and new, and outputs a copy of string s with every occurrence of substring old replaced by string new. For example:
>>> message.replace(‘top’, ‘no’) ‘This message is no secret and should not be divulged to anyone without no secret clearance’
Has this changed the string message? Let's check:
>>> print(message) This message is top secret and should not be divulged to anyone without top secret clearance
So string message was not changed by the replace() method. Instead, a copy of message, with appropriate substring replacements, got returned. This string cannot be used later on because we have not assigned it a variable name. Typically, the replace() method would be used in an assignment statement like this:
>>> public = message.replace(‘top’, ‘no’) >>> print(public) This message is no secret and should not be divulged to anyone without no secret clearance
Recall that strings are immutable (i.e., they cannot be modified). This is the reason why string method replace() returns a (modified) copy of the string invoking the method rather than changing the string. In the next example, we showcase a few other methods that return a modified copy of the string:
>>> message = ‘top secret’ >>> message.capitalize() ‘Top secret’ >>> message.upper() ‘TOP SECRET’
Method capitalize(), when called by string s, makes the first character of s uppercase; method upper() makes all the characters uppercase.
The very useful string method split() can be called on a string in order to obtain a list of words in the string:
>>> ‘this is the text’.split() [‘this’, ‘is’, ‘the’, ‘text’]
In this statement, the method split() uses the blank spaces in string ‘this is the text’ to create word substrings that are put into a list and returned. The method split() can also be called with a delimiter string as input: The delimiter string is used in place of the blank space to break up the string. For example, to break up the string
>>> x = ‘2;3;5;7;11;13’
into a list of number, you would use ‘;’ as the delimiter:
>>> x.split(‘;’) [‘2’, ‘3’, ‘5’, ‘7’, ‘11’, ‘13’]
Finally, another useful string method is translate(). It is used to replace certain characters in a string with others based on a mapping of characters to characters. Such a mapping is constructed using a special type of string method that is called not by a string object but by the string class str itself:
>>> table = str.maketrans(‘abcdef’, ‘uvwxyz’)
The variable table refers to a “mapping” of characters a, b, c, d, e, f to characters u, v, w, x, y, z, respectively. We discuss this mapping more thoroughly in Chapter 6. For our purposes here, it is enough to understand its use as an argument to the method translate():
>>> ‘fad’.translate(table) ‘zux’ >>> ‘desktop’.translate(table) ‘xysktop’
The string returned by translate() is obtained by replacing characters according to the mapping described by table. In the last example, d and e are replaced by x and y, but the other characters remain the same because mapping table does not include them.
A partial list of string methods is shown in Table 4.1. Many more are available, and to view them all, use the help() tool:
>>> help(str) …
Usage | Returned Value |
s.capitalize() | A copy of string s with the first character capitalized if it is a letter in the alphabet |
s.count(target) | The number of occurrences of substring target in string s |
s.find(target) | The index of the first occurrence of substring target in string s |
s.lower() | A copy of string s converted to lowercase |
s.replace(old, new) | A copy of string s in which every occurrence of substring old, when string s is scanned from left to right, is replaced by substring new |
s.translate(table) | A copy of string s in which characters have been replaced using the mapping described by table |
s.split(sep) | A list of substrings of strings s, obtained using delimiter string sep; the default delimiter is the blank space |
s.strip() | A copy of string s with leading and trailing blank spaces removed |
s.upper() | A copy of string s converted to uppercase |
Assuming that variable forecast has been assigned string
‘It will be a sunny day today’
write Python statements corresponding to these assignments:
The results of running a program are typically shown on the screen or written to a file. Either way, the results should be presented in a way that is visually effective. The Python output formatting tools help achieve that. In this section we learn how to format output using features of the print() function and the string format() method. The techniques we learn here will transfer to formatting output to files, which we discuss in the next section.
The print() function is used to print values onto the screen. Its input is an object and it prints a string representation of the object's value. (We explain where this string representation comes from in Chapter 8.)
>>> n = 5 >>> print(n) 5
Function print() can take an arbitrary number of input objects, not necessarily of the same type. The values of the objects will be printed in the same line, and blank spaces (i.e., characters ‘ ’) will be inserted between them:
>>> r = 5/3 >>> print(n, r) 5 1.66666666667 >>> name = ‘Ida’ >>> print(n, r, name) 5 1.66666666667 Ida
The blank space inserted between the values is just the default separator. If we want to insert semicolons between values instead of blank spaces, we can do that too. The print() function takes an optional separation argument sep, in addition to the objects to be printed:
>>> print(n, r, name, sep=‘;’) 5;1.66666666667;Ida
The argument sep=‘;’ specifies that semicolons should be inserted to separate the printed values of n, r, and name.
In general, when the argument sep=<some string> is added to the arguments of the print() function, the string <some string> will be inserted between the values. Here are some common uses of the separator. If we want to print each value separated by the string ‘, ’ (comma and blank space) we would write:
>>> print(n, r, name, sep=‘, ’) 5, 1.66666666667, Ida
If we want to print the values in separate lines, the separator should be the new line character, ‘ ’:
>>> print(n, r, name, sep=‘ ’) 5 1.66666666667 Ida
Write a statement that prints the values of variables last, first, and middle in one line, separated by a horizontal tab character. (The Python escape sequence for the horizontal tab character is .) If the variables are assigned like this:
>>> last = ‘Smith’ >>> first = ‘John’ >>> middle = ‘Paul’
the output should be:
Smith John Paul
The print() function supports another formatting argument, end, in addition to sep. Normally, each successive print() function call will print in a separate line:
>>> for name in [‘Joe’, ‘Sam’, ‘Tim’, ‘Ann’]: print(name) Joe Sam Tim Ann
The reason for this behavior is that, by default, the print() statement appends a new line character ( ) to the arguments to be printed. Suppose that the output we really want is:
Joe! Sam! Tim! Ann!
(We just saw our good friends, and we are in an exclamatory kind of mood.) When the argument end=<some string> is added to the arguments to be printed, the string <some string> is printed after all the arguments have been printed. If the argument end=<some string> is missing, then the default string ‘ ’, the new line character, is printed instead; this causes the current line to end. So, to get the screen output in the format we want, we need to add the argument end = ‘! ’ to our print() function call:
>>> for name in [‘Joe’, ‘Sam’, ‘Tim’, ‘Ann’]: print(name, end=‘! ’) Joe! Sam! Tim! Ann!
Write function even() that takes a positive integer n as input and prints on the screen all numbers between, and including, 2 and n divisible by 2 or by 3, using this output format:
>>> even(17) 2, 3, 4, 6, 8, 9, 10, 12, 14, 15, 16,
The sep argument can be added to the arguments of a print() function call to insert the same string between the values printed. Inserting the same separator string is not always what we want. Consider the problem of printing the day and time in the way we expect to see time, given these variables:
>>> weekday = ‘Wednesday’ >>> month = ‘March’ >>> day = 10 >>> year = 2010 >>> hour = 11 >>> minute = 45 >>> second = 33
What we want is to call the print() function with the preceding variables as input arguments and obtain something like:
Wednesday, March 10, 2010 at 11:45:33
It is clear that we cannot use a separator argument to obtain such an output. One way to achieve this output would be to use string concatenation to construct a string in the right format:
>>> print(weekday+', ‘+month+’ ‘+str(day)+’, ‘+str(year) +’ at ‘+str(hour)+’:‘+str(minute)+’:'str(second)) SyntaxError: invalid syntax (<pyshell#36>, line 1)
Ooops, I made a mistake. I forgot a + before str(second). That fixes it (check it!) but we should not be satisfied. The reason why I messed up is that the approach I used is very tedious and error prone. There is an easier, and far more flexible, way to format the output. The string (str) class provides a powerful class method, format(), for this purpose.
The format() string method is invoked on a string that represents the format of the output. The arguments of the format() function are the objects to be printed. To explain the use of the format() function, we start with a small version of our date and time example, in which we only want to print the time:
>>> ‘{0}:{1}:{2}’.format(hour, minute, second) ‘11:45:33’
The objects to be printed (hour, minute, and second) are arguments of the format() method. The string invoking the format() function—that is, the string ‘{0}:{1}:{2}’—is the format string: It describes the output format. All the characters outside the curly braces—that is, the two columns (‘:’)—are going to be printed as is. The curly braces {0}, {1}, and {2} are placeholders where the objects will be printed. The numbers 0, 1, and 2 explicitly indicate that the placeholders are for the first, second, and third arguments of the format() function call, respectively. See Figure 4.2 for an illustration.
Figure 4.3 shows what happens when we move the indexes 0, 1, and 2 in the previous example:
>>> ‘{2}:{0}:{1}’.format(hour, minute, second) ‘33:11:45’
The default, when no explicit number is given inside the curly braces, is to assign the first placeholder (from left to right) to the first argument of the format() function, the second placeholder to the second argument, and so on, as shown in Figure 4.4:
>>> ‘{}:{}:{}’.format(hour, minute, second) ‘11:45:33’
Let's go back to our original goal of printing the date and time. The format string we need is ‘{}, {} {}, {} at {}:{}:{}’ assuming that the format() function is called on variables weekday, month, day, year, hours, minutes, seconds in that order.
We check this (see also Figure 4.5 for the illustration of the mapping of variables to placeholders):
>>> print(‘{}, {} {}, {} at {}:{}:{}’.format(weekday, month, day, year, hour, minute, second)) Wednesday, March 10, 2010 at 11:45:33
Assume variables first, last, street, number, city, state, zipcode have already been assigned. Write a print statement that creates a mailing label:
John Doe 123 Main Street AnyCity, AS 09876
assuming that:
>>> first = ‘John’ >>> last = ‘Doe’ >>> street = ‘Main Street’ >>> number = 123 >>> city = ‘AnyCity’ >>> state = ‘AS’ >>> zipcode = ‘09876’
We now consider the problem of presenting data “nicely” lined up in columns. To motivate the problem, just think about how the From, Subject and Date fields in your email client are organized, or how the train or airline departure and arrival information is shown on screens. As we start dealing with larger amount of data, we too sometimes will need to present results in column format.
To illustrate the issues, let's consider the problem of properly lining up values of functions i2, i3 and 2i for i = 1,2,3,… Lining up the values properly is useful because it illustrates the very different growth rates of these functions:
i i**2 i**3 2**i 1 1 1 2 2 4 8 4 3 9 27 8 4 16 64 16 5 25 125 32 6 36 216 64 7 49 343 128 8 64 512 256 9 81 729 512 10 100 1000 1024 11 121 1331 2048 12 144 1728 4096
Now, how can we obtain this output? In our first attempt, we add a sep argument to the print() function to insert an appropriate number of spaces between the values printed in each row:
>>> print(‘i i**2 i**3 2**i’) >>> for i in range(1,13): print(i, i**2, i**3, 2**i, sep=‘ ’)
i i**2 i**3 2**i 1 1 1 2 2 4 8 4 3 9 27 8 4 16 64 16 5 25 125 32 6 36 216 64 7 49 343 128 8 64 512 256 9 81 729 512 10 100 1000 1024 11 121 1331 2048 12 144 1728 4096
While the first few rows look OK, we can see that the entries in the same column are not properly lined up. The problem is that a fixed size separator pushes entries farther to the right as the number of digits in the entry increases. A fized size separator is not the right tool for the job. The proper way to represent a column of numbers is to have all the unit digits line up. What we need is a way to fix the width of each column of numbers and print the values right-justified within these fixed-width columns. We can do that with format strings.
Inside the curly braces of a format string, we can specify how the value mapped to the curly brace placeholder should be presented; we can specify its field width, alignment, decimal precision, type, and so on.
We can specify the (minimum) field width with a decimal integer defining the number of character positions reserved for the value. If not specified or if the specified field width is insufficient, then the field width will be determined by the number of digits/characters in the displayed value. Here is an example:
>>> ‘{0:3},{1:5}’.format(12, 354) ‘ 12, 354’
In this example, we are printing integer values 12 and 354. The format string has a placeholder for 12 with ‘0:3’ inside the braces. The 0 refers to the first argument of the format() function (12), as we've seen before. Everything after the ‘:’ specifies the formatting of the value. In this case, 3 indicates that the width of the placeholder should be 3. Since 12 is a two-digit number, an extra blank space is added in front. The placeholder for 354 contains ‘1:5’, so an extra two blank spaces are added in front.
When the field width is larger than the number of digits, the default is to right-justify—that is, push the number value to the right. Strings are left-justified. In the next example, a field of width 10 characters is reserved for each argument first and last. Note that extra blanks are added after the string value:
>>> first = ‘Bill’ >>> last = ‘Gates’ >>> ‘{:10}{:10}’.format(first, last) ‘Bill Gates ’
The precision is a decimal number that specifies how many digits should be displayed before and after the decimal point of a floating-point value. It follows the field width and a period separates them. In the next example, the field width is 8 but only four digits of the floating-point value are displayed:
Type | Explanation |
b | Outputs the number in binary |
c | Outputs the Unicode character corresponding to the integer value |
d | Outputs the number in decimal notation (default) |
o | Outputs the number in base 8 |
x | Outputs the number in base 16, using lowercase letters for the digits above 9 |
X | Outputs the number in base 16, using uppercase letters for the digits above 9 |
>>> ‘{:8.4}’.format(1000 / 3) ‘ 333.3’
Compare this with the unformatted output:
>>> 1000 / 3 333.3333333333333
The type determines how the value should be presented. The available integer presentation types are listed in Table 4.2. We illustrate the different integer type options on integer value 10:
>>> n = 10 >>> ‘{:b}’.format(n) ‘1010’ >>> ‘{:c}’.format(n) ‘ ’ >>> ‘{:d}’.format(n) ‘10’ >>> ‘{:x}’.format(n) ‘a’ >>> ‘{:X}’.format(n) ‘A’
Two of the presentation-type options for floating-point value are f and e. The type option f displays the value as a fixed-point number (i.e., with a decimal point and fractional part).
>>> ‘{:6.2f}’.format(5 / 3) ‘ 1.67’
In this example, the format specification ‘:6.2f’ reserves a minimum width of 6 with exactly two digits past the decimal point for a floating-point value represented as a fixed-point number. The type option e represents the value in scientific notation in which the exponent is shown after the character e:
>>> ‘{:e}’.format(5 / 3) ‘1.666667e+00’
This represents 1.666667 · 100.
Now let's go back to our original problem of presenting the values of functions i2, i3, and 2i for i = 1,2,3,… up to at most 12. We specify a minimum width of 3 for the values i and 6 for the values of i2, i3, and 2i to obtain the output in the desired format.
Module: text.py
1 def growthrates(n): 2 ‘prints values of below 3 functions for i = 1,..,n’ 3 print(‘ i i**2 i**3 2**i’) 4 format_str = ‘{0:2d} {1:6d} {2:6d} {3:6d}’ 5 for i in range(2,n+1): 6 print(format_str.format(i, i**2, i**3, 2**i))
Implement function roster() that takes a list containing student information and prints out a roster, as shown below. The student information, consisting of the student's last name, first name, class, and average course grade, will be stored in that order in a list. Therefore, the input list is a list of lists. Make sure the roster printed out has 10 slots for every string value and 8 for the grade, including 2 slots for the decimal part.
>>> students = [] >>> students.append([‘DeMoines’, ‘Jim’, ‘Sophomore’, 3.45]) >>> students.append([‘Pierre’, ‘Sophie’, ‘Sophomore’, 4.0]) >>> students.append([‘Columbus’, ‘Maria’, ‘Senior’, 2.5]) >>> students.append([‘Phoenix’, ‘River’, ‘Junior’, 2.45]) >>> students.append([‘Olympis’, ‘Edgar’, ‘Junior’, 3.99]) >>> roster(students) Last First Class Average Grade DeMoines Jim Sophomore 3.45 Pierre Sophie Sophomore 4.00 Columbus Maria Senior 2.50 Phoenix River Junior 2.45 Olympia Edgar Junior 3.99
A file is a sequence of bytes stored on a secondary memory device, such as a disk drive. A file could be a text document or spreadsheet, an html file, or a Python module. Such files are referred to as text files. Text files contain a sequence of characters that are encoded using some encoding (ASCII, utf-8, etc.). A file also can be an executable application (like python.exe), an image or an audio file. Theses file are referred t as binary files because they are just a sequence of bytes and there is no encoding.
All files are managed by the file system, which we introduce next.
The file system is the component of a computer system that organizes files and provides ways to create, access, and modify files. While files may be physically stored on various secondary (hardware) memory devices, the file system provides a uniform view of the files that hides the differences between how files are stored on the different hardware devices. The effect is that reading or writing files is the same, whether the file is on a hard drive, flash memory stick, or DVD-RW.
Files are grouped together into directories or folders. A folder may contain other folders in addition to (regular) files. The file system organizes files and folders into a tree structure. The MAC OS X file system organization is illustrated in Figure 4.6. It is a convention in computer science to draw hierarchical tree structures upside down with the root of the tree on top.
The folder on top of the hierarchy is called the root directory. In UNIX, Mac OS X, and Linux file systems, the root folder is named /; in the MS Windows OS, every hardware device will have its own root directory (e.g., C:). Every folder and file in a file system has a name. However, a name is not sufficient to locate a file efficiently. Every file can be specified using a pathname that is useful for locating the file efficiently. The file pathname can be specified in two ways.
The absolute pathname of a file consists of the sequence of folders, starting from the root directory, that must be traversed to get to the file. The absolute pathname is represented as a string in which the sequence of folders is separated by forward (/) or backward () slashes, depending on the operating system.
For example, the absolute pathname of folder Python 3.1 is
/Applications/Python 3.1
while the absolute pathname of file example.txt is
/Users/lperkovic/example.txt
This is the case on UNIX, Mac OS X, and Linux boxes. On a Windows machine, the slashes are backward and the “first slash,” the name of the root folder, is instead C:.
Every command or program executed by the computer system has associated with it a current working directory. When using the command shell, the current working directory is typically listed at the shell prompt. When executing a Python module, the current working directory is typically the folder containing the module. After running a Python module from within the interactive shell (e.g., by pressing in the IDLE interactive shell), the folder containing the module becomes the current working directory for the interactive shell commands that follow.
The relative pathname of a file is the sequence of directories that must be traversed, starting from the current working directory, to get to the file. If the current working directory is Users, the relative pathname of file example.txt in Figure 4.6 is
lperkovic/example.txt
If the current working directory is lperkovic, the relative pathname of executable file date is
../../bin/date
The double-period notation (..) is used to refer to the parent folder, which is the folder containing the current working directory.
Processing a file consists of these three steps:
The built-in function open() is used to open a file, whether the file is a text file or a binary file. In order to read file example.txt, we must first open it:
infile = open(‘example.txt’, ‘r’)
The function open() takes three string arguments: a file name and, optionally, a mode and an encoding; we will not discuss the encoding argument until Chapter 6. The file name is really the pathname (absolute or relative) of the file to be opened. In the last example, the file relative pathname is example.txt. Python will look for a file named example.txt in the current working directory (recall that this will be the folder containing the module that was last imported); if no such file exists, an exception occurs. For example:
>>> infile = open(‘sample.txt’) Traceback (most recent call last): File “<pyshell#339>”, line 1, in <module> infile = open(‘sample.txt’) IOError: [Errno 2] No such file or directory: ‘sample.txt’
The file name could also be the absolute path of the file such as, for example
/Users/lperkovic/example.txt
on a UNIX box or
C:/Users/lperkovic/example.txt
on a Windows machine.
Backslashes or Forward Slashes in File System Paths?
In UNIX, Linux, and Mac OS X systems, the forward slash / is used as the delimiter in a path. In Microsoft Window systems, the backslash is used:
C:Userslperkovicexample.txt
That said, Python will accept the forward slash / in paths on a Windows system. This is a nice feature because the backslash inside a string is interpreted as the start of an escape sequence.
The mode is a string that specifies how we will interact with the opened file. In function call open(‘example.txt’, ‘r’), the mode ‘r’ indicates that the opened file will be read from; it also specifies that the file will be read from as a text file.
In general, the mode string may contain one of r, w, a, or r+, to indicate whether the file should be opened for reading, writing, appending, or reading and writing, respectively. If missing, the default is r. In addition, t or b could also appear in the mode string: t indicates that the file is a text file while b indicates it is a binary file. If neither is present, the file will be opened as a text file. So open(‘example.txt’, ‘r’) is equivalent to open(‘example.txt’, ‘rt’), which is equivalent to open(‘example.txt’). This is all summarized in Table 4.3.
Mode | Description |
r | Reading mode (default) |
w | Writing mode; if the file already exists, its content is wiped |
a | Append mode; writes are appended to the end of the file |
r+ | Reading and writing mode (beyond the scope of this book) |
t | Text mode (default) |
b | Binary mode |
The difference between opening a file as a text or binary file is that binary files are treated as a sequence of bytes and are not decoded when read or encoded when written to. Text files, however, are treated as encoded files using some encoding.
The open() function returns an object of an Input or Output Stream type that supports methods to read and/or write characters. We refer to this object as a file object. Different modes will give us file objects of different file types. Depending on the mode, the file type will support all or some of the methods described in Table 4.4.
The separate read methods are used to read the content of the file in different ways. We show the difference between the three on file example.txt whose content is:
File: example.txt
1 The 3 linesin this file end with the new line character. 2 3 Thereis a blank line above this line.
We start by opening the file for reading as a text input stream:
>>> infile = open(‘example.txt’)
Method Usage | Explanation |
infile.read(n) | Read n characters from the file infile or until the end of the file is reached, and return characters read as a string |
infile.read() | Read characters from file infile until the end of the file and return characters read as a string |
infile.readline() | Read file infile until (and including) the new line character or until end of file, whichever is first, and return characters read as a string |
infile.readlines() | Read file infile until the end of the file and return the characters read as a list lines |
outfile.write(s) | Write string s to file outfile |
file.close() | Close the file |
With every opened file, the file system will associate a cursor that points to a character in the file. When the file is first opened, the cursor typically points to the beginning of the file (i.e., the first character of the file), as shown in Figure 4.7. When reading the file, the characters that are read are the characters that start at the cursor; if we are writing to the file, then anything we write will be written starting at the cursor position.
We now use the read() function to read just one character. The read() function will return the first character in the file as a (one character) string.
>>> infile.read(1) ‘T’
After the character ‘T’ is read, the cursor will move and point to the next character, which is ‘h’ (i.e., the first unread character); see Figure 4.7. Let's use the read() function again, but now to read five characters at a time. What is returned is a string of the five characters following the character ‘T’ we initially read:
>>> infile.read(5) ‘he 3 ’
The function readline() will read characters from the file up to the end of the line (i.e., the new line character ) or until the end of the file, whichever happens first. Note that in our case the last character of the string returned by readline() is the new line character:
>>> infile.readline() ‘lines in this file end with the new line character. ’
The cursor now points to the beginning of the second line, as shown in Figure 4.7. Finally, we use the read() function without arguments to read the remainder of the file:
>>> infile.read() ‘ There is a blank line above this line. ’
The cursor now points at the “End-Of-File” (EOF) character, which indicates the end of the file.
To close the opened file that infile refers to, you just do:
infile.close()
Closing a file releases the file system resources that keep track of information about the opened file (i.e., the cursor position information).
Line Endings
If a file is read from or written to as a binary file, the file is just a sequence of bytes and there are no lines. An encoding must exist to have a code for a new line (i.e., a new line character). In Python, the new line character is represented by the escape sequence . However text file formats are platform dependent, and different operating systems use a different byte sequence to encode a new line:
Python translates platform-dependent line-ends into when reading and translates back to platform-dependent line-ends when writing. By doing this, Python becomes platform independent.
Depending on what you need to do with a file, there are several ways to access the file content and prepare it for processing. We describe several patterns to open a file for reading and read the content of the file. We will use the file example.txt again to illustrate the patterns:
1 The 3 lines in this file end with the new line character. 2 3 There is a blank line above this line.
One way to access the text file content is to read the content of the file into a string object. This pattern is useful when the file is not too large and string operations will be used to process the file content. For example, this pattern can be used to search the file content or to replace every occurrence of a substring with another.
We illustrate this pattern by implementing function numChars(), which takes the name of a file as input and returns the number of characters in the file. We use the read() function to read the file content into a string:
Module: text.py
1 def numChars(filename): 2 ‘returns the number of characters in file filename’ 3 infile = open(filename, ‘r’) 4 content = infile.read() 5 infile.close() 6 7 return len(content)
When we run this function on our example file, we obtain:
>>> numChars(‘example.txt’) 98
Write function stringCount() that takes two string inputs—a file name and a target string—and returns the number of occurrences of the target string in the file.
>>> stringCount(‘example.txt’, ‘line’) 4
The file reading pattern we discuss next is useful when we need to process the words of a file. To access the words of a file, we can read the file content into a string and use the string split() function, in its default form, to split the content into a list of words. (So, our definition of a word in this example is just a contiguous sequence of nonblank characters.) We illustrate this pattern on the next function, which returns the number of words in a file. It also prints the list of words, so we can see the list of words.
Module: text.py
1 def numWords(filename): 2 ‘returns the number of words in file filename’ 3 infile = open(filename, ‘r’) 4 content = infile.read() # read the file into a string 5 infile.close() 6 7 wordList = content.split() # split file into list of words 8 print(wordList) # print list of words too 9 return len(wordList)
Shown is the output when the function is run on our example file:
>>> numWords(‘example.txt’) [‘The’, ‘3’, ‘lines’, ‘in’, ‘this’, ‘file’, ‘end’, ‘with’, ‘the’, ‘new’, ‘line’, ‘character.’, ‘There’, ‘is’, ‘a’, ‘blank’, ‘line’, ‘above’, ‘this’, ‘line.’] 20
In function numWords(), the words in the list may include punctuation symbols, such as the period in ‘line.’. It would be nice if we removed punctuation symbols before splitting the content into words. Doing so is the aim of the next problem.
Write function words() that takes one input argument—a file name—and returns the list of actual words (without punctuation symbols !,.:;?) in the file.
>>> words(‘example.txt’) [‘The’, ‘3’, ‘lines’, ‘in’, ‘this’, ‘file’, ‘end’, ‘with’, ‘the’, ‘new’, ‘line’, ‘character’, ‘There’, ‘is’, ‘a’, ‘blank’, ‘line’, ‘above’, ‘this’, ‘line’]
Sometimes a text file needs to be processed line by line. This is done, for example, when searching a web server log file for records containing a suspicious IP address. A log file is a file in which every line is a record of some transaction (e.g., the processing of a web page request by a web server). In this third pattern, the readlines() function is used to obtain the content of the file as a list of lines. We illustrate the pattern on a simple function that counts the number of lines in a file by returning the length of this list. It also will print the list of lines so we can see what the list looks like.
Module: text.py
1 def numLines(filename): 2 ‘returns the number of lines in file filename’ 3 infile = open(filename, ‘r’) # open the file and read it 4 lineList = infile.readlines() # into a list of lines 5 infile.close() 6 7 print(lineList) # print list of lines 8 return len(lineList)
Let's test the function on our example file. Note that the new line character is included in each line:
>>> numLines(‘example.txt’) [‘The 3 lines in this file end with the new line character. ’, ‘ ’, ‘There is a blank line above this line. ’] 3
All file processing patterns we have seen so far read the whole file content into a string or a list of strings (lines). This approach is OK if the file is not too large. If the file is large, a better approach would be to process the file line by line; that way we avoid having the whole file in main memory. Python supports iteration over lines of a file object. We use this approach to print each line of the example file:
>>> infile = open(‘example.txt’) >>> for line in infile: print(line,end=‘’) The 3 lines in this file end with the new line character. There is a blank line above this line.
In every iteration of the for loop, the variable line will refer to the next line of the file. In the first iteration, variable line refers to the line ‘The three lines in…’; in the second, it refers to ‘ ’; and in the final iteration, it refers to ‘There is a blank…’. Thus, at any point in time, only one line of the file needs to be kept in memory.
Implement function myGrep() that takes as input two strings, a file name and a target string, and prints every line of the file that contains the target string as a substring.
>>> myGrep(‘example.txt’, ‘line’) The 3 lines in this file end with the new line character. There is a blank line above this line.
In order to write to a text file, the file must be opened for writing:
>>> outfile = open(‘test.txt’, ‘w’)
If there is no file test.txt in the current working directory, the open() function will create it. If a file text.txt exists, its content will be erased. In both cases, the cursor will point to the beginning of the (empty) file. (If we wanted to add more content to the (existing) file, we would use the mode ‘a’ instead of ‘w’.)
Once a file is opened for writing, function write() is used to write strings to it. It will write the string starting at the cursor position. Let's start with a one-character string:
>>> outfile.write(‘T’) 1
The value returned is the number of characters written to the file. The cursor now points to the position after T, and the next write will be done starting at that point.
>>> outfile.write(‘his is the first line.’) 22
In this write, 22 characters are written to the first line of the file, right after T. The cursor will now point to the position after the period.
>>> outfile.write(‘ Still the first line… ’) 25
Everything written up until the new line character is written in the same line. With the ‘ ’ character written, what follows will go into the second line:
>>> outfile.write(‘Now we are in the second line. ’) 31
The escape sequence indicates that we are done with the second line and will write the third line next. To write something other than a string, it needs to be converted to a string first:
>>> outfile.write(‘Non string value like ’+str(5)+‘ must be converted first. ’) 49
Here is where the string format() function is helpful. To illustrate the benefit of using string formatting, we print an exact copy of the previous line using string formatting:
>>> outfile.write(‘Non string value like {} must be converted first. ’.format(5)) 49
Just as for reading, we must close the file after we are done writing:
>>> outfile.close()
The file test.txt will be saved in the current working directory and will have this content:
1 This is the first line. Still the first line… 2 Now we are in the second line. 3 Non string value like 5 must be converted first. 4 Non string value like 5 must be converted first.
When a file is opened for writing, a buffer is created in memory. All writes to the file are really writes to this buffer; nothing is written onto the disk, at least not just yet.
The reason for not writing to disk is that writing to secondary memory such as a disk takes a long time, and a program making many writes would be very slow if each write had to done onto the secondary memory. What this means though is that no file is created in the file system until the file and the writes are flushed. The close() function will flush writes from the buffer to the file on disk before closing, so it is critical not to forget to close the file. You can also flush the writes without closing the file using the flush() function:
>>> outfile.flush()
We usually try to write programs that do not produce errors, but the unfortunate truth is that even programs written by the most experienced developers sometimes crash. And even if a program is perfect, it could still produce errors because the data coming from outside the program (interactively from the user or from a file) is malformed and causes errors in the program. This is a big problem with server programs, such as web, mail, and gaming servers: We definitely do not want an error caused by a bad user request to crash the server. Next we study some of the types of errors that can occur before and during program execution.
Two basic types of errors can occur when running a Python program. Syntax errors are errors that are due to the incorrect format of a Python statement. These errors occur while the statement or program is being translated to machine language and before it is being executed. A component of Python's interpreter called a parser discovers these errors. For example, expression:
>>> (3+4] SyntaxError: invalid syntax
is an invalid expression that the parser cannot process. Here are some more examples:
>>> if x == 5 SyntaxError: invalid syntax >>> print ‘hello’ SyntaxError: invalid syntax >>> lst = [4;5;6] SyntaxError: invalid syntax >>> for i in range(10): print(i) SyntaxError: expected an indented block
In each of these statements, the error is due to an incorrect syntax (format) of a Python statement. So these errors occur before Python has even a chance of executing the statement on the given arguments, if any.
Explain what causes the syntax error in each statement just listed. Then write a correct version of each Python statement.
We now focus on errors that occur during the execution of the statement or program. They do not occur because of a malformed Python statement or program but rather because the program execution gets into an erroneous state. Here are some examples. Note that in each case, the syntax (i.e., the format of the Python statement) is correct.
An error caused by a division by 0:
>>> 4 / 0 Traceback (most recent call last): File “<pyshell#52>”, line 1, in <module> 4 / 0 ZeroDivisionError: division by zero
An error caused by an invalid list index:
>>> lst = [14, 15, 16] >>> lst[3] Traceback (most recent call last): File “<pyshell#84>”, line 1, in <module> lst[3] IndexError: list index out of range
An error caused by an unassigned variable name:
>>> x + 5 Traceback (most recent call last): File “<pyshell#53>”, line 1, in <module> x + 5 NameError: name ‘x’ is not defined
An error caused by incorrect operand types:
>>> ‘2’ * ‘3’ Traceback (most recent call last): File “<pyshell#54>”, line 1, in <module> ‘2’ * ‘3’ TypeError: cant multiply sequence by non-int of type ‘str’
An error caused by an illegal value:
>>> int(‘4.5’) Traceback (most recent call last): File “<pyshell#80>”, line 1, in <module> int(‘4.5’) ValueError: invalid literal for int() with base 10: ‘4.5’
In each case, an error occurs because the statement execution got into an invalid state. Dividing by 0 is invalid and so is using a list index that is outside of the range of valid indexes for the given list. When this happens, we say that the Python interpreter raises an exception. What this means is that an object gets created, and this object contains all the information relevant to the error. For example, it will contain the error message that indicates what happened and the program (module) line number at which the error occurred. (In the preceding examples, the line number is always 1 because there is only one statement in an interactive shell statement “program”.) When an error occurs, the default is for the statement or program to crash and for error information to be printed.
The object created when an error occurs is called an exception. Every exception has a type (a type as in int or list) that is related to the type of error. In the last examples, we saw these exception types: ZeroDivisionError, IndexError, NameError, TypeError, and ValueError. Table 4.5 describes these and a few other common errors.
Let's see a few more examples of exceptions. An OverflowError object is raised when a floating-point expression evaluates to a floating-point value outside the range of values representable using the floating-point type. In Chapter 3, we saw this example:
>>> 2.0**10000 Traceback (most recent call last): File “<pyshell#92>”, line 1, in <module> 2.0**10000 OverflowError: (34, ‘Result too large’)
Interestingly, overflow exceptions are not raised when evaluating integer expressions:
>>> 2**10000 1995063116880758384883742162683585083823496831886192454852008949852943 … # many more lines of numbers 0455803416826949787141316063210686391511681774304792596709376
(You may recall that values of type int are, essentially, unbounded.)
The KeyboardInterupt exception is somewhat different from other exceptions because it is interactively and explicitly raised by the program user. By hitting during the execution of a program, the user can interrupt a running program. This will cause the program to get into an erroneous, interrupted, state. The exception raised by the Python interpreter is of type KeyboardInterrupt. Users typically hit to interrupt a program (when, for example, it runs too long):
Exception | Explanation |
KeyboardInterrupt | Raised when user hits Ctrl-C, the interrupt key |
OverflowError | Raised when a floating-point expression evaluates to a value that is too large |
ZeroDivisionError | Raised when attempting to divide by 0 |
IOError | Raised when an I/O operation fails for an I/O-related reason |
IndexError | Raised when a sequence index is outside the range of valid indexes |
NameError | Raised when attempting to evaluate an unassigned identifier (name) |
TypeError | Raised when an operation of function is applied to an object of the wrong type |
ValueError | Raised when operation or function has an argument of the right type but incorrect value |
>>> for i in range(2**100): pass
The Python statement pass does nothing (for real)! It is used wherever code is required to appear (as in the body of a for loop) but no action is to be done. By hitting , we stop the program and get a KeybordInterrupt error message:
>>> for i in range(2**100): pass KeyboardInterrupt
An IOError exception is raised when an input/output operation fails. For example, we could be trying to open a file for reading but a file with the given name does not exist:
>>> infile = open(‘exaple.txt’) Traceback (most recent call last): File “<pyshell#55>”, line 1, in <module> infile = open(‘exaple.txt’) IOError: [Errno 2] No such file or directory: ‘exaple.txt’
An IOError exception is also raised when a user attempts to open a file she is not permitted to access.
We showcase the material covered in this chapter by developing an application that records file accesses. Every time a user opens a file using this application, a record—refered to as a log—is created and then appended to a special text file—referred to as the log file. The log is a one-line string that includes the name of the opened file and the access time and date.
Let's illustrate what the application should be doing more precisely. Recall that to open the file example.txt for reading, we need to use the function open():
>>> infile = open(‘example.txt’, ‘r’)
What we want to develop is a similar function, called openLog(), that also opens a file. Just like function open(), it would take as input the (path)name of a file and return a reference to the opened file:
File: example.txt
>>> infile = openLog(‘example.txt’, ‘r’)
In addition, the function openLog() would create a log and append it to a log file called log.txt. This means that if we were to open and read the file log.txt, the last line would contain a log associated with the access to example.txt we just did:
>>> logfile = open(‘log.txt’) >>> for log in logfile: print(log, end = ‘’) Friday Aug/05/11 08:56 AM: File example.txt opened.
(We assume that file log.txt did not exist prior to opening example.txt.)
Any subsequent file accesses that use openLog() would also be logged. So if we were to open another file right after, say for writing:
File: example2.txt
>>> outfile = openLog(‘example2.txt’, ‘w’)
then the log recording this access would be appended to the existing log file. We would check that in this way:
>>> logfile = open(‘log.txt’) >>> for log in logfile: print(log, end = ‘’) Friday Aug/05/11 08:56 AM: File example.txt opened. Friday Aug/05/11 08:57 AM: File example2.txt opened.
So, there would be a log in log.txt corresponding to every instance when a file was opened using function openLog().
The reason to log file accesses is that doing so enables us to obtain valuable statistics. For example, if the files were web pages hosted on a web server, the log file could be used to obtain statistics on
among others. This information could be used to fine-tune the web server performance.
Let's start the implementation of function openLog(). The function takes as input the (path)name of a file and a file mode and returns a reference to the opened file. If we ignore the need to record the file access, the implementation is simply:
def openLog(filename, mode): infile = open(filename, mode) return infile
The function openLog() uses the existing function open() to open the file and obtain the reference to the opened file, which it then returns. When the implementation of a function f() is essentially a single call to another function g(), we say that function f() is a wrapper around function g().
We now expand the implementation of openLog() to include the recording of the name of the opened file. What this means is that every time function openLog() is called, the following must be done:
This intermediate implementation of openLog(), implements these steps:
def openLog(filename, mode): infile = open(filename, mode) # open file log.txt in append mode and append log outfile = open(‘log.txt’, ‘a’) outfile.write(‘File {} opened. ’.format(filename)) outfile.close() return infile
What remains to be done is to log the access time. The current date and time are obtained by “asking” the underlying operating system. In Python, the time module is the application programming interface (API) through which a Python program obtains time information from the operating system.
The time module provides an API to the operating system time utilities as well as tools to format date and time values. We start by importing the time module:
>>> import time
Several functions in the time module return some version of the current time. The time() function returns the time in seconds since the epoch:
>>> time.time() 1268762993.335
Epoch, Time, and UTC Time
Computers keep track of time by keeping track of the number of seconds since a certain point in time, the epoch. On UNIX- and Linux-based computers (including Mac OS X), the epoch starts at 00:00:00 of January, 1, 1970, Greenwich time.
In order to keep track of the correct number of seconds since the epoch, computers need to know how long a second takes. Every computer has in its central processing unit (CPU) a quartz clock for this purpose (and also to control the length of the “clock cycle”.) The problem with quartz clocks is that they are not “perfect” and will deviate from “real time” after a while. This is a problem with today's networked computers because many Internet applications require the computers to agree on time (at least within a small error).
Today's networked computers keep synchronizing their quartz clocks with time servers across the Internet whose job is to serve the “official time” called the Coordinated Universal Time, or UTC time. UTC is the average time of about a dozen atomic clocks and is supposed to track the mean solar time (based on Earth's rotation around the sun) at the Royal Observatory in Greenwich England.
With time servers across the Internet serving this internationally agreed standard time, computers can agree on what time it is (within a small error).
You can check the epoch for your computer system using another function that returns the time in a format very different from time():
>>> time.gmtime(0) time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour= 0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)
The value returned by the function is a bit complex; we discuss what type of object the function gmtime() returns in Chapter 6. But we do not need to know this to see that the epoch (i.e., the time and date 0 seconds since the epoch) is 00:00:00 on January 1, 1970 UTC. It is UTC time, because the function gmtime(), if given integer input s, returns the UTC time s seconds since the start of the epoch. If no argument is given to the function gmtime(), it will return the current UTC time. The related function localtime() returns the local time zone current time instead:
>>> time.localtime() time.struct_time(tm_year=2010, tm_mon=3, tm_mday=16, tm_hour= 13, tm_min=50, tm_sec=46, tm_wday=1, tm_yday=75, tm_isdst=1)
The output format is not very readable (and is not designed to be). Module time provides a formatting function strftime() that outputs time in the desired format. This function takes a format string and the time returned by gmtime() or localtime() and outputs the time in a format described by the format string. Here is an example, illustrated in Figure 4.8:
>>> time.strftime(‘%A %b/%d/%y %I:%M %p’, time.localtime()) ‘Tuesday Mar/16/10 02:06 PM’
In this example, strftime() prints the time returned by time.localtime() in the format specified by the format string ‘%A %b/%d/%y %I:%M %p’. The format string includes directives %A, %b, %d, %y, %I, %M, and %p that specify what date and time values to output at the directive's location, using the mapping shown in Table 4.6. All the other characters (/, :, and the blank spaces) of the format string are copied to the output as is.
Start by setting t to be the local time 1,500,000,000 seconds from the start of January 1, 1970 UTC:
>>> import time >>> t = time.localtime(1500000000)
Construct the next strings by using the string time format function strftime():
Directive | Output |
%a | Abbreviated weekday name |
%A | Full weekday name |
%b | Abbreviated month name |
%B | Full month name |
%d | The day of the month as a decimal number between 01 and 31 |
%H | The hours as a number between 00 and 23 |
%I | The hours as a number between 01 and 12 |
%M | The minutes as a number between 00 and 59 |
%p | AM or PM |
%S | Seconds as a number between 00 and 61 |
%y | Year without century as a number between 00 and 99 |
%Y | Year as a decimal number |
%Z | Time zone name |
We can now complete the implementation of function openLog().
Module: ch4.py
1 import time 2 def openLog(filename, mode = ‘r’): 3 ‘‘‘open file filename in given mode and return reference to 4 opened file; also log the file access in file log.txt’’’ 5 6 infile = open(filename, mode) 7 8 # obtain current time 9 now = time.localtime() 10 nowFormat = time.strftime(‘%A %b/%d/%y %I:%M %p’, now) 11 12 # open file log.txt in append mode and append log 13 outfile = open(‘log.txt’, ‘a’) 14 log = ‘{}: File {} opened. ’ # format string 15 outfile.write(log.format(nowFormat, filename)) 16 outfile.close() 17 18 return infile
In this chapter we introduce Python text-processing and file-processing tools.
We revisit the string str class that was introduced in Chapter 2 and describe the different ways string values can be defined, using single, double, or triple quotes. We describe how to use escape sequences to define special characters in strings. Finally, we introduce the methods supported by the class str, as only string operators were covered in Chapter 2.
A string method we focus on is method format(), which is used to control the format of the string when printed using the print() function. We explain the syntax of format strings that describe the output format. After having mastered string output formatting, you will be able to focus on more complex aspects of your programs rather than on achieving the desired output format.
This chapter also introduces file-processing tools. We first explain the concepts of a file and of a file system. We introduce methods to open and close a file and methods read(), to read a file, and write(), to write a string to a file. Depending on how a file will be processed, there are different patterns for reading a file, and we describe them.
Programming errors were discussed informally in previous chapters. Because of the higher likelihood of errors when working with files, we formally discuss what errors are and define exceptions. We list the different types of exceptions students are likely to encounter.
In the chapter case study, we put another spotlight on output formatting in the context of developing an application that logs accesses to files. We also introduce the valuable Standard Library module time that provides functions to obtain the time and also formatting functions that output time in a desired format.
(a) s[2:5], (b) s[7:9], (c) s[1:8], (d) s[:4], and (e) s[7:] (or s[-3:]).
4.3 The tab character is used as the separator.
>>> print(last, first, middle, sep=‘ ’)
4.4 The function range() is used to iterate over integers from 2 to n; each such integer is tested and, if divisible by 2 or 3, printed with a end = ‘, ’ argument.
def even(n) for i in range(2,n+1): if i%2 == 0 or i%3 == 0: print(i, end=‘, ’)
4.5 We only need to place a comma and two new line characters appropriately:
>>> fstring = ‘{} {} {} {} {}, {} {}’ >>> print(fstring.format(first,last,number,street,city,state,zipcode))
4.6 The solution uses the floating-point presentation type f:
def roster(students): ‘prints average grad for a roster of students’ print(‘Last First Class Average Grade’) for student in students: print(‘{:10}{:10}{:10}{:8.2f}’.format(student[0], student[1], student[2], student[3]))
4.7 Making the file content into a string allows the use of string functions to count the number of occurrences of substring target.
def stringCount(filename, target): ‘‘‘returns the number of occurrences of string target in content of file filename’’’ infile = open(filename) content = infile.read() infile.close() return content.count(target)
4.8 To remove punctuation from a text, one can use the string translate() method to replace every punctuation character with the empty string ‘’:
def numWords2(filename): ‘returns the number of words in file filename’ infile = open(filename, ‘r’) content = infile.read() infile.close() table = str.maketrans(‘!,.:;?’, 6*‘ ’) content=content.translate(table) content=content.lower() return content.split()
4.9 Iterating over the lines of the file does the job:
def myGrep(filename, target): ‘prints every line of file filename containing string target’ infile = open(filename) for line in infile: if target in line: print(line, end=‘’)
4.10 The causes of the syntax errors and the correct versions are:
>>> for i in range(3): print(i)
4.11 The format strings are obtained as shown:
4.12 Start by running, in the interactive shell, this assignment statement:
>>> s = ‘abcdefghijklmnopqrstuvwxyz’
Now write expressions using string s and the indexing operator that evaluate to ‘bcd’, ‘abc’, ‘defghijklmnopqrstuvwx’, ‘wxy’, and ‘wxyz’.
4.13 Let string s be defined as:
s = ‘goodbye’
Write Python Boolean expressions that correspond to these propositions:
4.14 Translate each line into a Python statement:
128.0.0.1 - - [12/Feb/2011:10:31:08 -0600] “GET/docs/test.txt HTTP/1.0”
4.15 For each of the below string values of s, write the expression involving s and the string methods split() that evaluates to list:
[‘10’, ‘20’, ‘30’, ‘40’, ‘50’, ‘60’]
4.16 Implement a program that requests three words (strings) from the user. Your program should print Boolean value True if the words were entered in dictionary order; otherwise nothing is printed.
>>> Enter first word: bass Enter second word: salmon Enter third word: whitefish True
4.17 Translate each line into a Python statement using appropriate string methods:
4.18 Suppose variable s has been assigned in this way:
s = ‘‘‘It was the best of times, it was the worst of times; it was the age of wisdom, it was the age of foolishness; it was the epoch of belief, it was the epoch of incredulity; it was…’’’
(The beginning of A Tale of Two Cities by Charles Dickens.) Then do the following, in order, each time:
4.19 Write Python statements that print the next formatted outputs using the already assigned variables first, middle, and last:
>>> first = ‘Marlena’ >>> last = ‘Sigel’ >>> middle = ‘Mae’
4.20 Given string values for the sender, recipient, and subject of an email, write a string format expression that uses variables sender, recipient, and subject and that prints as shown here:
>>> sender = ‘[email protected]’ >>> recipient = ‘[email protected]’ >>> subject = ‘Hello!’ >>> print(???) # fill in From: [email protected] To: [email protected] Subject: Hello!
4.21 Write Python statements that print the values of π and the Euler constant e in the shown formats:
4.22 Write a function month() that takes a number between 1 and 12 as input and returns the three-character abbreviation of the corresponding month. Do this without using an if statement, just string operations. Hint: Use a string to store the abbreviations in order.
>>> month(1) ‘Jan’ >>> month(11) ‘Nov’
4.23 Write a function average() that takes no input but requests that the user enter a sentence. Your function should return the average length of a word in the sentence.
>>> average() Enter a sentence: A sample sentence 5.0
4.24 Implement function cheer() that takes as input a team name (as a string) and prints a cheer as shown:
>>> cheer(‘Huskies’) How do you spell winner? I know, I know! H U S K I E S ! And that's how you spell winner! Go Huskies!
4.25 Write function vowelCount() that takes a string as input and counts and prints the number of occurrences of vowels in the string.
>>> vowelCount(‘Le Tour de France’) a, e, i, o, and u appear, respectively, 1, 3, 0, 1, 1 times.
4.26 The cryptography function crypto() takes as input a string (i.e., the name of a file in the current directory). The function should print the file on the screen with this modification: Every occurrence of string ‘secret’ in the file should be replaced with string ‘xxxxxx’.
File: crypto.txt
>>> crypto(‘crypto.txt’) I will tell you my xxxxxx. But first, I have to explain why it is a xxxxxx. And that is all I will tell you about my xxxxxx.
4.27 Write a function fcopy() that takes as input two file names (as strings) and copies the content of the first file into the second.
File: crypto.txt
>>> fcopy(‘example.txt’,‘output.txt’) >>> open(‘output.txt’).read() ‘The 3 lines in this file end with the new line character. There is a blank line above this line. ’
4.28 Implement function links() that takes as input the name of an HTML file (as a string) and returns the number of hyperlinks in that file. To do this you will assume that each hyperlink appears in an anchor tag. You also need to know that every anchor tag ends with the substring <a>.
Test your code on HTML file twolinks.html or any HTML file downloaded from the web into the folder where your program is.
File: twolinks.html
>>> links(‘twolinks.html’) 2
4.29 Write a function stats() that takes one input argument: the name of a text file. The function should print, on the screen, the number of lines, words, and characters in the file; your function should open the file only once.
File: example.txt
>>>stats(‘example.txt’) line count: 3 word count: 20 character count: 98
4.30 Implement function distribution() that takes as input the name of a file (as a string). This one-line file will contain letter grades separated by blanks. Your function should print the distribution of grades, as shown.
File: grades.txt
>>> distribution(‘grades.txt’) 6 students got A 2 students got A- 3 students got B+ 2 students got B 2 students got B- 4 students got C 1 student got C- 2 students got F
4.31 Implement function duplicate() that takes as input a string and the name of a file in the current directory and returns True if the file contains duplicate words and False otherwise.
File: Duplicates.txt
>>> duplicate(‘Duplicates.txt’) True
File: noDuplicates.txt
>>> duplicate(‘noDuplicates.txt’) False
4.32 The function censor() takes the name of a file (a string) as input. The function should open the file, read it, and then write it into file censored.txt with this modification: Every occurrence of a four-letter word in the file should be replaced with string ‘xxxx’.
File: example.txt
>>> censor(‘example.txt’)
Note that this function produces no output, but it does create file censored.txt in the current folder.