#19 Comics Download

Every morning I get up, go to the computer, and read the morning paper. Actually the "paper" is a set of bookmarks in Mozilla. I happen to love editorial cartoons. Unfortunately, editorial cartoonists don't create new works daily, so I'm forced to view a large number of pictures I've seen before.

So I decided to see if Perl could help me and designed a program to download new cartoons from the Web. Old cartoons get skipped.

So now I get up, run the script, and view just the new stuff. It's amazing how a little technology can dejunk your life.

The Code

  1 #!/usr/bin/perl
  2 use strict;
  3 use warnings;
  4
  5 use LWP::Simple;
  6 use HTML::SimpleLinkExtor;
  7 use URI;
  8 use POSIX;
  9
 10 # Information on the comics
 11 my $in_file = "comics.txt";
 12
 13 # File with last download info
 14 my $info_file = "comics.info";
 15
 16 my %file_info;  # Information on the last download
 17
 18 #############################################################
 19 # do_file($name, $page, $link, $index)
 20 #
 21 # Download the given link and store it in a file.
 22 #       If multiple file are present,
 23 #               $index should be different
 24 #       for each file.
 25 #############################################################
 26 sub do_file($$$$)
 27 {
 28     my $name = shift;   # Name of the file
 29     my $page = shift;   # The base page
 30     my $link = shift;   # Link to grab
 31     my $index = shift;  # Index (if multiple files)
 32
 33     # Try and get the extension of the file from the link
 34     $link =~ /(.[^$.]*)$/;
 35
 36     # Define the extension of the file
 37     my $ext;
 38     if (defined($1)) {
 39         $ext = $1;
 40     } else {
 41         $ext = ".jpg";
 42     }
 43
 44     my $uri = URI->new($link);
 45     my $abs_link = $uri->abs($page);
 46
 47     # Get the heading information of the link
 48     # (and the modification time goes into $2);
 49     my @head = head($abs_link->as_string());
 50     if ($#head == -1) {
 51         print "$name Broken link: ",
 52             $abs_link->as_string(), "
";
 53         return;
 54     }
 55     if (defined($file_info{$name})) {
 56         # If we've downloaded this one before
 57         if ($head[2] == $file_info{$name}) {
 58             print "Skipping $name
";
 59             return;
 60         }
 61     }
 62     # Set the file information
 63     $file_info{$name} = $head[2];
 64
 65     # Time of the last modification
 66     my $time = asctime(localtime($head[2]));
 67     chomp($time);       # Stupid POSIX hack
 68
 69     print "Downloading $name (Last modified $time)
";
 70     # The raw data from the page
 71     my $raw_data = get($abs_link->as_string());
 72     if (not defined($raw_data)) {
 73         print "Unable to download link $link
";
 74         return;
 75     }
 76     my $out_name;        # Name of the output file
 77
 78     if (defined($index)) {
 79         $out_name = "comics/$name.$index$ext";
 80     } else {
 81         $out_name = "comics/$name$ext";
 82     }
 83     if (not open(OUT_FILE, ">$out_name")) {
 84         print "Unable to create $out_name
";
 85         return;
 86     }
 87     binmode OUT_FILE;
 88     print OUT_FILE $raw_data;
 89     close OUT_FILE;
 90 }
 91
 92 #------------------------------------------------------------
 93 open INFO_FILE, "<$info_file";
 94 while (1) {
 95     my $line = <INFO_FILE>;     # Get line from info file
 96
 97     if (not defined($line)) {
 98         last;
 99     }
100     chomp($line);
101     # Get the name and time of the last download
102     my ($name, $time) = split /	/, $line;
103     $file_info{$name} = $time;
104 }
105 close INFO_FILE;
106
107 open IN_FILE, "<$in_file"
108     or die("Could not open $in_file");
109
110
111 while (1) {
112     my $line = <IN_FILE>;       # Get line from the input
113     if (not defined($line)) {
114         last;
115     }
116     chomp($line);
117
118     # Parse the information from the config file
119     my ($name, $page, $pattern) = split /	/, $line;
120
121     # If the input is bad, fuss and skip 122     if (not defined($pattern)) {
123         print "Illegal input $line
";
124         next;
125     }
126
127     # Get the text page which points to the image page
128     my $text_page = get($page);
129
130     if (not defined($text_page)) {
131         print "Could not download $page
";
132         next;
133     }
134
135     # Create a decoder for this page
136     my $decoder = HTML::SimpleLinkExtor->new();
137     $decoder->parse($text_page);
138
139     # Get the image links
140     my @links = $decoder->img();
141     my @matches = grep /$pattern/, @links;
142
143     if ($#matches == -1) {
144         print "Nothing matched pattern for $name
";
145         print " Pattern: $pattern
";
146         foreach my $cur_link (@links) {
147             print "     $cur_link
";
148         }
149         next;
150     }
151     if ($#matches != 0) {
152         print "Multiple matches
";
153         my $index = 1;
154         foreach my $cur_link (@matches) {
155             print "     $cur_link
";
156             do_file($name, $page, $cur_link, $index);
157             ++$index;
158         }
159         next;
160     }
161     # One match
162     do_file($name, $page, $matches[0], undef);
163 }
164
165 open INFO_FILE, ">$info_file" or
166    die("Could not create $info_file");
167
168 foreach my $cur_name (sort keys %file_info) {
169     print INFO_FILE "$cur_name  $file_info{$cur_name}
";
170 }
171 close (INFO_FILE);

Running the Script

First, create a directory called comics. This is where the images will be stored.

The next thing you'll need to do is to create a comics.txt file. Each line in the file has the following format:

name--->url--->pattern

The parts of the format have the following meanings:

--->
The tab character.
name
The name of the entry. This name will be used when it comes time to store the result. It should be something simple like dilbert.
url
The URL of the web page that contains the comic. This is not the URL of the comic image itself since these URLs change from day to day. For the Dilbert comic strip, this would be http://www.dilbert.com.
pattern
A regular expression that will be matched to all the links within the web page, as in this example:
^/comics/dilbert/archive/images/dilbertd+.gif$

That's a lot of information, so how do you get the information filled in for each of the fields? The first field is simple: make up a name, a single word describing the comic.

For the next one, visit the website of your favorite comic. Copy the URL from the address box and put it in your code.

Now right-click on the comic and select View Image. You should see a screen with just the image on it. Copy the URL from this image and put it in your file. Now turn it into a regular expression by escaping all the bad characters, such as dots (.), as well as putting a caret (^) at the beginning and a dollar sign ($) at the end. If you see something that looks a date or serial number, replace the series of digits with the matching regular expression syntax. Thus dilbert2004183061028.gif becomes dilbertd+.gif (note the escaped dot (.) in the string).

So the line in your comics.txt file looks like this:

dilbert      http://www.dilbert.com/  ^http://www.dilbert.com/ comics/
dilbert/archive/images/dilbertd+.gif$

(It's all on one line with tabs separating the three pieces.)

You're not done yet. When you run the script, you'll get an error message:

Nothing matched pattern for dilbert
    Pattern: ^http://www.dilbert.com/archive/comics/dilbert/archive/images/
dilbertd+.gif$

    /comics/dilbert/images/small_ad.gif
        /images/clear_dot.gif
        /images/ffffff_dot.gif
    /comics/dilbert/archive/images/dilbert2004183061028.gif
    /images/000000_dot.gif

(This error output has greatly been shortened.)

What's happened is that you put in a pattern that matches an absolute link and the web page contains a relative link. You now need to go through the list of image links (which the script so thoughtfully spewed out) and find one that look something like your pattern.

The entry

        /comics/dilbert/archive/images/dilbert2004183061028.gif

looks promising. So you go back to your original file and edit it so that the URL matcher now starts at /comics. The result is this:

^/comics/dilbert/archive/images/dilbertd+.gif$

This is now the entry you'll use when you run the script.

The Results

Here's the output of a typical run:

Downloading dilbert (Last modified Mon Oct  4 15:58:59 2004)
Downloading shoe (Last modified Fri Oct  1 21:11:32 2004)
Skipping userfriendly
Skipping ed_ann
Skipping ed_luck
Downloading ed_matt (Last modified Mon Oct 25 16:01:04 2004)
Downloading ed_mccoy (Last modified Wed Oct 27 21:01:09 2004)
Skipping ed_ohman

A set of new images is stored in the comics directory. Unfortunately, copyright laws prevent me from including them in this book.

How It Works

The script needs two pieces of information to work: (1) what to download and (2) when was it last downloaded.

The first is stored in the hand-generated configuration file comics.txt. The second is stored in the file comics.info. This file is automatically generated and updated by the script. The format of this file is as follows:

name date

The name component is the name of the comics as defined by the comics.txt file. The date component is the modification date from the image URL.

The first step is to read in the comics.info file and store it in the %file_info hash. The keys to this hash are names of the comics and the values are the last modified date.

 13 # File with last download info
 14 my $info_file = "comics.info";
...
 92 #------------------------------------------------------------
 93 open INFO_FILE, "<$info_file";
 94 while (1) {
 95     my $line = <INFO_FILE>;     # Get line from info file
 96
 97     if (not defined($line)) {
 98         last;
 99     }
100     chomp($line);
101     # Get the name and time of the last download
102     my ($name, $time) = split /	/, $line;
103     $file_info{$name} = $time;
104 }
105 close INFO_FILE;

Next you start on the configuration file comics.txt:

 10 # Information on the comics
 11 my $in_file = "comics.txt";
...
106
107 open IN_FILE, "<$in_file"
108     or die("Could not open $in_file");

Each line is read in and parsed:

111 while (1) {
112     my $line = <IN_FILE>;       # Get line from the input
113     if (not defined($line)) {
114         last;
115     }
116     chomp($line);
117
118     # Parse the information from the config file
119     my ($name, $page, $pattern) = split /	/, $line;

Just in case something went wrong, you check to make sure that there are three tab-separated fields on the line. If there's no field #3, you are most likely very upset:

121     # If the input is bad, fuss and skip
122     if (not defined($pattern)) {
123         print "Illegal input $line
";
124         next;
125     }

The script now grabs the main web page for the entry (i.e., http://www.dilbert.com). This page contains a link to the image, which is what you really want:

127     # Get the text page which points to the image page
128     my $text_page = get($page);
129
130     if (not defined($text_page)) {
131         print "Could not download $page
";
132         next;
133     }

You have the page; now you need to extract the links so you can attempt to find one that matches your pattern. Fortunately, there is a Perl module that chews up web pages and spits out links. It's called HTML::SimpleLinkExtor. Using this module, you get a set of image links:

135     # Create a decoder for this page
136     my $decoder = HTML::SimpleLinkExtor->new();
137     $decoder->parse($text_page);
138
139     # Get the image links
140     my @links = $decoder->img();

Now all you have to do is check each link against your regular expression to see if it matches. Perl performs this amazing feat with one statement:

141     my @matches = grep /$pattern/, @links;

At this point, you may have zero, one, or more matches. Zero matches means that your regular expression is bad. Here's how to tell the user about it and list all the URLs you did find so they can correct the problem:

143     if ($#matches == -1) {
144         print "Nothing matched pattern for $name
";
145         print " Pattern: $pattern
";
146         foreach my $cur_link (@links) {
147             print "     $cur_link
";
148         }
149         next;
150     }

This produces the very verbose error message you saw earlier. (Incidentally, that error message was cut to 15 percent of its real length.)

The next thing you look for is multiple matches. If you have multiple image links that match your expression, you download them all. The do_file function handles the downloading (see the following code), and all you have to do is call it. You use an index for each call to tell do_file to use different names for each image:

151     if ($#matches != 0) {
152         print "Multiple matches
";
153         my $index = 1;
154         foreach my $cur_link (@matches) {
155             print "     $cur_link
";
156             do_file($name, $page, $cur_link, $index);
157             ++$index;
158         }
159         next;
160     }

The only case you haven't handled yet is the one in which only one URL matches. For that, the processing is very simple; it is just a call to do_file:

161     # One match
162     do_file($name, $page, $matches[0], undef);

The do_file function does the actual work of getting the image. The first thing it does is compute the extension of the file you are going to write. The extension will be the same as the URL; if the URL has no extension, you default to .jpg:

 33     # Try and get the extension of the file from the link
 34     $link =~ /(.[^$.]*)$/;
 35
 36     # Define the extension of the file
 37     my $ext;
 38     if (defined($1)) {
 39         $ext = $1;
 40     } else {
 41         $ext = ".jpg";
 42     }

Now comes the only tricky part of your code. You have a relative link and you need to turn it into an absolute one. Perl has a package for just about everything, but you have to know what to ask for. The language used for specifying web pages is HTML and the protocol used for web communication is called HTTP. Turns out that the package to transform relative links into absolute ones is under neither of the two names.

Instead, it's filed under URI, for Uniform Resource Indicator. This is the name of the format used to specify links. So you use the URI package to turn your relative link into an absolute one:

 44     my $uri = URI->new($link);
 45     my $abs_link = $uri->abs($page);

Next you get the header of the image. This first thing this tells you is whether or not the link is broken. (On my favorite editorial cartoon site, there is frequently trouble keeping the servers up.) Here's the code:

 47     # Get the heading information of the link
 48     # (and the modification time goes into $2);
 49     my @head = head($abs_link->as_string());
 50     if ($#head == -1) {
 51         print "$name Broken link: ",
 52             $abs_link->as_string(), "
";
 53         return;
 54     }

The head function of the LWP::Simple module returns the document type, length, modification time, and other information. The modification time is in field number 2. This is checked against the modification time of the last page you downloaded.

If they are the same, you skip this page:

 55     if (defined($file_info{$name})) {
 56         # If we've downloaded this one before
 57         if ($head[2] == $file_info{$name}) {
 58             print "Skipping $name
";
 59             return;
 60         }
 61     }

A new comic has arrived. Store its modification time for future reference:

 62     # Set the file information
 63     $file_info{$name} = $head[2];

Now download the comic and write it out:

 71     my $raw_data = get($abs_link->as_string());
...
 83     if (not open(OUT_FILE, ">$out_name")) {
 84         print "Unable to create $out_name
";
 85         return;
 86     }
 87     binmode OUT_FILE;
 88     print OUT_FILE $raw_data;
 89     close OUT_FILE;

After all the files are closed, the only thing left is a little post-download cleanup. All you need to do is write out the file information (filename, modification date pairs) so you will download only the new stuff on the next run:

165 open INFO_FILE, ">$info_file" or
166    die("Could not create $info_file");
167
168 foreach my $cur_name (sort keys %file_info) {
169     print INFO_FILE "$cur_name  $file_info{$cur_name}
";
170 }
171 close (INFO_FILE);

Hacking the Script

Although the script is designed for comics, it can be used any time you need to grab a web page, locate a link, and get content.

Another neat trick would be to not only download the data but also create a web page that displays all your new comics. That way, you create your own morning paper that consists of nothing but comics. After all, comics are the only useful part of the paper. With a little Perl, you can create the perfect web paper: all comics and no news.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset