#6 Website Link Checker

One of the most vexing problems facing a webmaster is making sure that all the links on their website are correct. Internal links are difficult to deal with. Every time a file is added, removed, or changed on your website, there is the possibility of generating dead links.

External links are even worse. Not only are they not under your control, but they disappear without a moment's notice.

What's needed is a way of automatically checking a site for links that just don't work. That's where Perl comes in.

The Code

   1 #
   2 # Usage: site-walk.pl <top-url>
   3 #
   4 use strict;
   5 use warnings;
   6
   7 use HTML::SimpleLinkExtor;
   8 use LWP::Simple;
   9 use URI::URL;
  10
  11 my $top_url;    # The URL at the top of the tree
  12
  13 # Indexed by link name
  14 # Value =
  15 #       Internal -- Good internal link
  16 #       External -- Good External link
  17 #       Broken   -- Broken link
  18 my %links;
  19
  20 ##########################################################
  21 # is_ours($url) -- Check to see if a URL is part of this
  22 #       website.
  23 #
  24 # Returns
  25 #       undef -- not us
  26 #       1 -- URL part of this website
  27 ##########################################################
  28 sub is_ours($)
  29 {
  30     my $url = shift;    # The URL to check
  31
  32     if (substr($url, 0, length($top_url)) ne $top_url) {
  33         return (undef);
  34     }
  35     return (1);
  36 }
  37
  38 ########################################################
  39 # process_url($url)
  40 #
  41 # Read an html page and extract the tags.
  42 #
  43 # Set $links{$url} to Broken, Internal, External
  44 # depending on the nature of the url
  45 ########################################################
  46 no warnings 'recursion';       # Turn off recursion warning
  47
  48 sub process_url($);     # Needed because this is recursive
  49 sub process_url($)
  50 {
  51     my $url = shift;    # The file url to process
  52
  53     # Did we do it already
  54     if (defined($links{$url})) {
  55         return;
  56     }
  57     # It's bad unless we know it's OK
  58     $links{$url} = "Broken";
  59
  60     my @head_info = head($url);
  61     if ($#head_info == -1) {
  62         return; # The link is bad
  63     }
  64
  65     $links{$url} = "External";
  66
  67     # Return if it does not belong to this tree
  68     if  (not is_ours($url)) {
  69         return;
  70     }
  71     $links{$url} = "Internal";
  72
  73     # If the document length is not defined then it's
  74     # probably a CGI script
  75     if (not defined($head_info[1])) {
  76         return;
  77     }
  78
  79     # Is this an HTML page?
  80     if ($head_info[0] !~ /^text/html/) {
  81         return;
  82     }
  83
  84     # The parser object to extract the list
  85     my $extractor = HTML::SimpleLinkExtor->new();
  86
  87     my $data = get($url);
  88     if (not defined($data)) {
  89         $links{$url} = "Broken";
  90         return;
  91     }
  92
  93     # Parse the file
  94     $extractor->parse($data);
  95
  96     # The list of all the links in the file
  97     my @all_links = $extractor->links();
  98
  99     # Check each link
 100     foreach my $cur_link (@all_links) {
 101         # The page as URL object
 102         my $page = URI::URL->new($cur_link, $url);
 103
 104         # The absolute version of the URL
 105         my $full = $page->abs();
 106
 107         # Now go through he URL types we know about
 108         # and check what we can check
 109         if ($full =~ /^ftp:/) {
 110             next;       # Ignore ftp links
 111         } elsif ($full =~ /^mailto:/) {
 112             next;       # Ignore mailto links
 113         } elsif ($full =~ /^http:/) {
 114             process_url($full);
 115         } else {
 116             print "Strange URL: $full -- Skipped.
";
 117         }
 118     }
 119 }
 120 # Turn off deep recursion warning
 121 use warnings 'recursion';
 122
 123 if ($#ARGV != 0) {
 124     print STDERR "$0 <top-url>
";
 125     exit(8);
 126 }
 127 $top_url = $ARGV[0];
 128
 129 process_url($top_url);
 130
 131 my @internal;   # List of internal links
 132 my @external;   # List of external links
 133 my @broken;     # List of broken links
 134 my @strange;    # List of strange links
 135 # If we get any strange links, something broke in the program
 136
 137 # Sort the links into categories
 138 foreach my $cur_key (keys %links) {
 139     if ($links{$cur_key} eq "Internal") {
 140         push(@internal, $cur_key);
 141     } elsif ($links{$cur_key} eq "External") {
 142         push(@external, $cur_key);
 143     } elsif ($links{$cur_key} eq "Broken") {
 144         push(@broken, $cur_key);
 145     } else {
 146         push(@strange, $cur_key);
 147     }
 148 }
 149
 150 #
 151 # Print the results
 152 #
 153 print "Internal
";
 154 foreach my $cur_url (sort @internal) {
 155     print "	$cur_url
";
 156 }
 157 print "External
";
 158 foreach my $cur_url (sort @external) {
 159     print "	$cur_url
";
 160 }
 161 print "Broken
";
 162 foreach my $cur_url (sort @broken) {
 163     print "	$cur_url
";
 164 }
 165 if ($#strange != -1) {
 166     print "Strange
";
 167     foreach my $cur_url (sort @strange) {
 168         print "	$cur_url
";
 169     }
 170 }

Running the Script

The script takes, one argument: the top-level URL for the website:

$ site-check.pl http://www.oualline.com

The script will check the given URL and all URLs on that site, or more technically, the top URL and all URLs that begin with the same absolute URL as the given one.

The Results

Internal
        http://www.oualline.com
        http://www.oualline.com/10/.vimrc
        http://www.oualline.com/10/top_10.html
        http://www.oualline.com/10/vimrc.html
        http://www.oualline.com/cgi-bin/errata.pl?book=c
        http://www.oualline.com/cgi-bin/errata.pl?book=cpp
        http://www.oualline.com/cgi-bin/errata.pl?book=vim
        http://www.oualline.com/col/bully.html
        http://www.oualline.com/col/check.html
        http://www.oualline.com/col/cpm.html
        http://www.oualline.com/col/excuse.html
...more links omitted...

External
        http://www.exam-ta.ac.uk/practicalc.htm
        http://www.nostarch.com/hownotc.htm
        http://www.nostarch.com/images/hownotc_cov.gif
        http://www.openoffice.org/
        http://www.powaymidlandrr.org/
        http://www.vim.org/

Broken
        http://www.amazon.com/exec/obidos/ts/book-reviews/0764531050/thedanubetravelg
/002-3438930-8810611
        http://www.newriders.com/appendix/0735710015.pdf
        http://www.newriders.com/books/title.cfm?isbn=0735710015
        http://www.oualline.com/hello/hello1_pl_4.html
        http://www.oualline.com/hello/hello1_pl_a.html
        http://www.oualline.com/ship/ins/ins.sxi
        http://www.oualline.com/teach/slides/port.pdf

How It Works

The process is fairly simple:

  1. Read a web page.

  2. Check to make sure that all the links are correct.

  3. If any link on the page is a link to this website, repeat the process for this link.

In practice things are not quite that simple. There are about 5,000 little details to worry about. Most of the actual checking work is done in the process_file function. Its job is to process a URL and create a hash called %links that contains the results of that processing. The key of %links is the URL itself, and the value is Broken, External, or Internal.

The first thing the function does is check to see if it already has processed this URL. After all, there's no reason to do the same work twice:

  53     # Did we do it already
  54     if (defined($links{$url})) {
  55         return;
  56     }

You start by assuming the worst: specifically, that the link is broken. If it later passes all tests, you'll change its status to something else:

  57     # It's bad unless we know it's OK
  58     $links{$url} = "Broken";

The next step is to actually check the link. For this, you use the head function from the LWP::Simple package. This not only checks the link but gives you some information that you use later. However, if the head function returns nothing, the link is broken and you give up at this point (leaving $links{$url} set to Broken):

  60     my @head_info = head($url);
  61     if ($#head_info == -1) {
  62         return; # The link is bad
  63     }

At this point, you know the URL is good, so you assume that it is an external link and then test your assumption by calling is_ours. If the assumption is true, you're done and no further processing is needed:

  65     $links{$url} = "External";
  66
  67     # Return if it does not belong to this tree
  68     if  (not is_ours($url)) {
  69         return;
  70     }

The is_ours subroutine is very simple. All it does is check to see if the beginning of the URL matches the top web page you started with:

  28 sub is_ours($)
  29 {
  30     my $url = shift;    # The URL to check
  31
  32     if (substr($url, 0, length($top_url)) ne $top_url) {
  33         return (undef);
  34     }
  35     return (1);
  36 }

Back to your process_url function: You've figured out that the URL is good and now know that it's one of yours. This means that it is an internal link:

  71    $links{$url} = "Internal";

Your link-checking program now needs to go through this internal URL and look for any links that it may have. But there are certain types of URLs that you don't want to check. These include dynamically generated data (i.e., CGI scripts). Because the web server does not know the length of dynamic data, the size field of the header ($head_info[1]) is zero. If you find such a header, you don't process the URL:

  75     if (not defined($head_info[1])) {
  76         return;
  77     }

A website can contain a lot of different types of files, such as images, raw text, and binary data. Only an HTML page can contain links. So you check the header to make sure that the MIME type ($head_info[x]) is "text/html":

  79     # Is this an HTML page?
  80     if ($head_info[0] !~ /^text/html/) {
  81         return;
  82     }

If you get this far, then you have a internal URL of an HTML page. You need to check every link on this page. First you grab the page using the get function from the LWP::Simple module (if this fails, then the link suddenly became broken between the time you called the head function and now):

  87     my $data = get($url);
  88     if (not defined($data)) {
  89         $links{$url} = "Broken";
  90         return;

You've got the page; now you need the links. Perl has a module called HTML::SimpleLinkExtor that will parse a web page, figure out what links it contains, and return them to you as an array.

  84     # The parser object to extract the list
  85     my $extractor = HTML::SimpleLinkExtor->new();
...
  92
  93     # Parse the file
  94     $extractor->parse($data);
  95
  96     # The list of all the links in the file
  97     my @all_links = $extractor->links();

Now all you have to do is go through each one and check it:

 100    foreach my $cur_link (@all_links) {

Unfortunately, this is not just as simple as calling process_url on each link. First of all, there are two flavors of links, absolute and relative. An absolute link looks like this:

http://www.oualline.com/vim_cook.html

A relative link looks like this:

check.html

Since you started on the page:

http://www.oualline.com/col

the actual absolute URL you want to use is:

http://www.oualline.com/col/check.html

Again, there is a Perl module, URI::URL, that can be used to take a relative URL and turn it into an absolute one. Once you have the absolute URL, you can it back into the process_url function for checking:

 100     foreach my $cur_link (@all_links) {
 101         # The page as URL object
 102         my $page = URI::URL->new($cur_link, $url);
 103
 104         # The absolute version of the URL
 105         my $full = $page->abs();

You finally have a URL that you can check. But not all URLs are checkable. For example, there is no way to check a mailto-type URL. So as a final filter, you examine the URL and only check the protocols you know about, specifically HTTP. The FTP and mailto protocols are not checked. When we encounter a protocol we don't know about, such as telnet (i.e., telnet://www.terminalserver.com) or ed2k (i.e., ed2k://ed2k.fileshare.com/moves/5135.ed2k), we log it. That way the user is aware that something strange has been seen and we let him worry about it.

 106
 107         # Now go through the URL types we know about
 108         # and check what we can check
 109         if ($full =~ /^ftp:/) {
 110             next;       # Ignore ftp links
 111         } elsif ($full =~ /^mailto:/) {
 112             next;       # Ignore mailto links
 113         } elsif ($full =~ /^http:/) {
 114             process_url($full);
 115         } else {
 116             print "Strange URL: $full -- Skipped.
";
 117         }
 118     }
 119 }

After process_url does its work, you have a hash called %links that contains the results. You need to sort out the elements of this hash into something more useful, so you go through the hash and produce the arrays @internal, @external, and @broken. If something goes wrong with your program, you stick any unknown hash entry in the @strange array:

 137 # Sort the links into categories
 138 foreach my $cur_key (keys %links) {
 139     if ($links{$cur_key} eq "Internal") {
 140         push(@internal, $cur_key);
 141     } elsif ($links{$cur_key} eq "External") {
 142         push(@external, $cur_key);
 143     } elsif ($links{$cur_key} eq "Broken") {
 144         push(@broken, $cur_key);
 145     } else {
 146         push(@strange, $cur_key);
 147     }
 148 }

What's left is to print the result. First you print the internal links:

 153 print "Internal
";
 154 foreach my $cur_url (sort @internal) {
 155     print "	$cur_url
";
 156 }

The external, broken, and strange links are printed in a similar manner.

Hacking the Script

The script does a good job of checking HTTP-type links. However, no checking is done of mailto- and FTP-type links. Code could be added to verify that the mailto links point to a valid email address. Also, it's possible to check to see that the server in an FTP link exists. With a little more code, you could check the link itself.

There are other protocols that are not covered by this script, including things like RST, telnet, and HTTPS. These can easily be added.

The basic framework is there, and with a little hacking it can easily be expanded.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset