The Internet is a treasure trove of information, but its exponential growth can make it hard to manage. Furthermore, most tools currently available for “surfing the Web” are not programmable. Many web-related tasks can be automated quite simply with the tools in the standard Python distribution.
If you’re interested in finding out what the weather in a given location is over a period of months, it’s much easier to set up an automated program to get the information and collect it in a file than to have to remember to do it by hand.
Here is a program that finds the weather in a couple of cities and states using the pages of the weather.com web site:
import urllib, urlparse, string, time def get_temperature(country, state, city): url = urlparse.urljoin('http://www.weather.com/weather/cities/', string.lower(country)+'_' + string.lower(state) + '_' + string.replace(string.lower(city), ' ', '_') + '.html') data = urllib.urlopen(url).read() start = string.index(data, 'current temp: ') + len('current temp: ') stop = string.index(data, '°F', start-1) temp = int(data[start:stop]) localtime = time.asctime(time.localtime(time.time())) print ("On %(localtime)s, the temperature in %(city)s, " + "%(state)s %(country)s is %(temp)s F.") % vars() get_temperature('FR', '', 'Paris') get_temperature('US', 'RI', 'Providence') get_temperature('US', 'CA', 'San Francisco')
When run, it produces output like:
~/book:> python get_temperature.py
On Wed Nov 25 16:22:25 1998, the temperature in Paris, FR is 39 F.
On Wed Nov 25 16:22:30 1998, the temperature in Providence, RI US is 39 F.
On Wed Nov 25 16:22:35 1998, the temperature in San Francisco, CA US is 58 F.
The code in get_temperature.py suffers from one flaw, which is that the logic of the URL creation and of the temperature extraction is dependent on the specific HTML produced by the web site you use. The day the site’s graphic designer decides that “current temp:” should be spelled with capitalized words, this script won’t work. This is a problem with programmatic parsing of web pages that will go away only when more structural formats (such as XML) are used to produce web pages.[66]
One of the big hassles of maintaining a web
site is that as the number of links in the site increases, so does
the chance that some of the links will no longer be valid. Good
web-site maintenance therefore includes periodic checking for such
stale links. The standard Python distribution includes a tool that
does just this. It lives in the Tools/webchecker
directory and is called
webchecker.py
.
A companion program called websucker.py
located
in the same directory uses similar logic to create a local copy of a
remote web site. Be careful when trying it out, because if
you’re not careful, it will try to download the entire Web on
your machine! The same directory includes two programs called
wsgui.py
and webgui.py
that
are Tkinter-based frontends to websucker and
webchecker, respectively. We encourage you to
look at the source code for these programs to see how one can build
sophisticated web-management systems with Python’s standard
toolset.
In the Tools/Scripts
directory, you’ll
find many other small to medium-sized scripts that might be of
interest, such as an equivalent of websucker.py
for FTP servers called ftpmirror.py
.
Electronic
mail is probably the most important
medium on the Internet today; it’s certainly the protocol with
which most information passes between individuals. Python includes
several libraries for processing mail. The one you’ll need to
use depends on the kind of mail server you’re using. Modules
for interacting with POP3 servers (poplib
) and
IMAP servers (imaplib
) are included. If you need
to talk to a Microsoft Exchange server, you’ll need some of the
tools in the win32 distribution (see Appendix B, for pointers to the
win32 extensions web page).
Here’s a simple test of the poplib
module,
which is used to talk to a mail server running the POP protocol:
>>>from poplib import *
>>>server = POP3('mailserver.spam.org')
>>>print server.getwelcome()
+OK QUALCOMM Pop server derived from UCB (version 2.1.4-R3) at spam starting. >>>server.user('da')
'+OK Password required for da.' >>>server.pass_('youllneverguess')
'+OK da has 153 message(s) (458167 octets).' >>>header, msg, octets = server.retr(152)
# let's get the latest msgs >>>import string
>>>print string.join(msg[:3], ' ')
# and look at the first three lines Return-Path: <[email protected]> Received: from gator.bigbad.com by mailserver.spam.org (4.1/SMI-4.1) id AA29605; Wed, 25 Nov 98 15:59:24 PST
In a real application, you’d use a specialized module such as
rfc822
to parse the header lines, and perhaps the
mimetools
and mimify
modules to
get the data out of the message body (e.g., to process attached
files).
[66] XML (eXtensible Markup Language) is a language for marking up structured text files that emphasizes the structure of the document, not its graphical nature. XML processing is an entirely different area of Python text processing, with much ongoing work. See Appendix A, for some pointers to discussion groups and software.