One of the most vexing problems facing a webmaster is making sure that all the links on their website are correct. Internal links are difficult to deal with. Every time a file is added, removed, or changed on your website, there is the possibility of generating dead links.
External links are even worse. Not only are they not under your control, but they disappear without a moment's notice.
What's needed is a way of automatically checking a site for links that just don't work. That's where Perl comes in.
1 # 2 # Usage: site-walk.pl <top-url> 3 # 4 use strict; 5 use warnings; 6 7 use HTML::SimpleLinkExtor; 8 use LWP::Simple; 9 use URI::URL; 10 11 my $top_url; # The URL at the top of the tree 12 13 # Indexed by link name 14 # Value = 15 # Internal -- Good internal link 16 # External -- Good External link 17 # Broken -- Broken link 18 my %links; 19 20 ########################################################## 21 # is_ours($url) -- Check to see if a URL is part of this 22 # website. 23 # 24 # Returns 25 # undef -- not us 26 # 1 -- URL part of this website 27 ########################################################## 28 sub is_ours($) 29 { 30 my $url = shift; # The URL to check 31 32 if (substr($url, 0, length($top_url)) ne $top_url) { 33 return (undef); 34 } 35 return (1); 36 } 37 38 ######################################################## 39 # process_url($url) 40 # 41 # Read an html page and extract the tags. 42 # 43 # Set $links{$url} to Broken, Internal, External 44 # depending on the nature of the url 45 ######################################################## 46 no warnings 'recursion'; # Turn off recursion warning 47 48 sub process_url($); # Needed because this is recursive 49 sub process_url($) 50 { 51 my $url = shift; # The file url to process 52 53 # Did we do it already 54 if (defined($links{$url})) { 55 return; 56 } 57 # It's bad unless we know it's OK 58 $links{$url} = "Broken"; 59 60 my @head_info = head($url); 61 if ($#head_info == -1) { 62 return; # The link is bad 63 } 64 65 $links{$url} = "External"; 66 67 # Return if it does not belong to this tree 68 if (not is_ours($url)) { 69 return; 70 } 71 $links{$url} = "Internal"; 72 73 # If the document length is not defined then it's 74 # probably a CGI script 75 if (not defined($head_info[1])) { 76 return; 77 } 78 79 # Is this an HTML page? 80 if ($head_info[0] !~ /^text/html/) { 81 return; 82 } 83 84 # The parser object to extract the list 85 my $extractor = HTML::SimpleLinkExtor->new(); 86 87 my $data = get($url); 88 if (not defined($data)) { 89 $links{$url} = "Broken"; 90 return; 91 } 92 93 # Parse the file 94 $extractor->parse($data); 95 96 # The list of all the links in the file 97 my @all_links = $extractor->links(); 98 99 # Check each link 100 foreach my $cur_link (@all_links) { 101 # The page as URL object 102 my $page = URI::URL->new($cur_link, $url); 103 104 # The absolute version of the URL 105 my $full = $page->abs(); 106 107 # Now go through he URL types we know about 108 # and check what we can check 109 if ($full =~ /^ftp:/) { 110 next; # Ignore ftp links 111 } elsif ($full =~ /^mailto:/) { 112 next; # Ignore mailto links 113 } elsif ($full =~ /^http:/) { 114 process_url($full); 115 } else { 116 print "Strange URL: $full -- Skipped. "; 117 } 118 } 119 } 120 # Turn off deep recursion warning 121 use warnings 'recursion'; 122 123 if ($#ARGV != 0) { 124 print STDERR "$0 <top-url> "; 125 exit(8); 126 } 127 $top_url = $ARGV[0]; 128 129 process_url($top_url); 130 131 my @internal; # List of internal links 132 my @external; # List of external links 133 my @broken; # List of broken links 134 my @strange; # List of strange links 135 # If we get any strange links, something broke in the program 136 137 # Sort the links into categories 138 foreach my $cur_key (keys %links) { 139 if ($links{$cur_key} eq "Internal") { 140 push(@internal, $cur_key); 141 } elsif ($links{$cur_key} eq "External") { 142 push(@external, $cur_key); 143 } elsif ($links{$cur_key} eq "Broken") { 144 push(@broken, $cur_key); 145 } else { 146 push(@strange, $cur_key); 147 } 148 } 149 150 # 151 # Print the results 152 # 153 print "Internal "; 154 foreach my $cur_url (sort @internal) { 155 print " $cur_url "; 156 } 157 print "External "; 158 foreach my $cur_url (sort @external) { 159 print " $cur_url "; 160 } 161 print "Broken "; 162 foreach my $cur_url (sort @broken) { 163 print " $cur_url "; 164 } 165 if ($#strange != -1) { 166 print "Strange "; 167 foreach my $cur_url (sort @strange) { 168 print " $cur_url "; 169 } 170 }
The script takes, one argument: the top-level URL for the website:
$ site-check.pl http://www.oualline.com
The script will check the given URL and all URLs on that site, or more technically, the top URL and all URLs that begin with the same absolute URL as the given one.
Internal http://www.oualline.com http://www.oualline.com/10/.vimrc http://www.oualline.com/10/top_10.html http://www.oualline.com/10/vimrc.html http://www.oualline.com/cgi-bin/errata.pl?book=c http://www.oualline.com/cgi-bin/errata.pl?book=cpp http://www.oualline.com/cgi-bin/errata.pl?book=vim http://www.oualline.com/col/bully.html http://www.oualline.com/col/check.html http://www.oualline.com/col/cpm.html http://www.oualline.com/col/excuse.html ...more links omitted... External http://www.exam-ta.ac.uk/practicalc.htm http://www.nostarch.com/hownotc.htm http://www.nostarch.com/images/hownotc_cov.gif http://www.openoffice.org/ http://www.powaymidlandrr.org/ http://www.vim.org/ Broken http://www.amazon.com/exec/obidos/ts/book-reviews/0764531050/thedanubetravelg /002-3438930-8810611 http://www.newriders.com/appendix/0735710015.pdf http://www.newriders.com/books/title.cfm?isbn=0735710015 http://www.oualline.com/hello/hello1_pl_4.html http://www.oualline.com/hello/hello1_pl_a.html http://www.oualline.com/ship/ins/ins.sxi http://www.oualline.com/teach/slides/port.pdf
The process is fairly simple:
Read a web page.
Check to make sure that all the links are correct.
If any link on the page is a link to this website, repeat the process for this link.
In practice things are not quite that simple. There are about 5,000 little details to worry about. Most of the actual checking work is done in the process_file function. Its job is to process a URL and create a hash called %links that contains the results of that processing. The key of %links is the URL itself, and the value is Broken, External, or Internal.
The first thing the function does is check to see if it already has processed this URL. After all, there's no reason to do the same work twice:
53 # Did we do it already 54 if (defined($links{$url})) { 55 return; 56 }
You start by assuming the worst: specifically, that the link is broken. If it later passes all tests, you'll change its status to something else:
57 # It's bad unless we know it's OK 58 $links{$url} = "Broken";
The next step is to actually check the link. For this, you use the head function from the LWP::Simple package. This not only checks the link but gives you some information that you use later. However, if the head function returns nothing, the link is broken and you give up at this point (leaving $links{$url} set to Broken):
60 my @head_info = head($url); 61 if ($#head_info == -1) { 62 return; # The link is bad 63 }
At this point, you know the URL is good, so you assume that it is an external link and then test your assumption by calling is_ours. If the assumption is true, you're done and no further processing is needed:
65 $links{$url} = "External"; 66 67 # Return if it does not belong to this tree 68 if (not is_ours($url)) { 69 return; 70 }
The is_ours subroutine is very simple. All it does is check to see if the beginning of the URL matches the top web page you started with:
28 sub is_ours($) 29 { 30 my $url = shift; # The URL to check 31 32 if (substr($url, 0, length($top_url)) ne $top_url) { 33 return (undef); 34 } 35 return (1); 36 }
Back to your process_url function: You've figured out that the URL is good and now know that it's one of yours. This means that it is an internal link:
71 $links{$url} = "Internal";
Your link-checking program now needs to go through this internal URL and look for any links that it may have. But there are certain types of URLs that you don't want to check. These include dynamically generated data (i.e., CGI scripts). Because the web server does not know the length of dynamic data, the size field of the header ($head_info[1]) is zero. If you find such a header, you don't process the URL:
75 if (not defined($head_info[1])) { 76 return; 77 }
A website can contain a lot of different types of files, such as images, raw text, and binary data. Only an HTML page can contain links. So you check the header to make sure that the MIME type ($head_info[x]) is "text/html":
79 # Is this an HTML page? 80 if ($head_info[0] !~ /^text/html/) { 81 return; 82 }
If you get this far, then you have a internal URL of an HTML page. You need to check every link on this page. First you grab the page using the get function from the LWP::Simple module (if this fails, then the link suddenly became broken between the time you called the head function and now):
87 my $data = get($url); 88 if (not defined($data)) { 89 $links{$url} = "Broken"; 90 return;
You've got the page; now you need the links. Perl has a module called HTML::SimpleLinkExtor that will parse a web page, figure out what links it contains, and return them to you as an array.
84 # The parser object to extract the list 85 my $extractor = HTML::SimpleLinkExtor->new(); ... 92 93 # Parse the file 94 $extractor->parse($data); 95 96 # The list of all the links in the file 97 my @all_links = $extractor->links();
Now all you have to do is go through each one and check it:
100 foreach my $cur_link (@all_links) {
Unfortunately, this is not just as simple as calling process_url on each link. First of all, there are two flavors of links, absolute and relative. An absolute link looks like this:
http://www.oualline.com/vim_cook.html
A relative link looks like this:
check.html
Since you started on the page:
http://www.oualline.com/col
the actual absolute URL you want to use is:
http://www.oualline.com/col/check.html
Again, there is a Perl module, URI::URL, that can be used to take a relative URL and turn it into an absolute one. Once you have the absolute URL, you can it back into the process_url function for checking:
100 foreach my $cur_link (@all_links) { 101 # The page as URL object 102 my $page = URI::URL->new($cur_link, $url); 103 104 # The absolute version of the URL 105 my $full = $page->abs();
You finally have a URL that you can check. But not all URLs are checkable. For example, there is no way to check a mailto-type URL. So as a final filter, you examine the URL and only check the protocols you know about, specifically HTTP. The FTP and mailto protocols are not checked. When we encounter a protocol we don't know about, such as telnet (i.e., telnet://www.terminalserver.com) or ed2k (i.e., ed2k://ed2k.fileshare.com/moves/5135.ed2k), we log it. That way the user is aware that something strange has been seen and we let him worry about it.
106 107 # Now go through the URL types we know about 108 # and check what we can check 109 if ($full =~ /^ftp:/) { 110 next; # Ignore ftp links 111 } elsif ($full =~ /^mailto:/) { 112 next; # Ignore mailto links 113 } elsif ($full =~ /^http:/) { 114 process_url($full); 115 } else { 116 print "Strange URL: $full -- Skipped. "; 117 } 118 } 119 }
After process_url does its work, you have a hash called %links that contains the results. You need to sort out the elements of this hash into something more useful, so you go through the hash and produce the arrays @internal, @external, and @broken. If something goes wrong with your program, you stick any unknown hash entry in the @strange array:
137 # Sort the links into categories 138 foreach my $cur_key (keys %links) { 139 if ($links{$cur_key} eq "Internal") { 140 push(@internal, $cur_key); 141 } elsif ($links{$cur_key} eq "External") { 142 push(@external, $cur_key); 143 } elsif ($links{$cur_key} eq "Broken") { 144 push(@broken, $cur_key); 145 } else { 146 push(@strange, $cur_key); 147 } 148 }
What's left is to print the result. First you print the internal links:
153 print "Internal "; 154 foreach my $cur_url (sort @internal) { 155 print " $cur_url "; 156 }
The external, broken, and strange links are printed in a similar manner.
The script does a good job of checking HTTP-type links. However, no checking is done of mailto- and FTP-type links. Code could be added to verify that the mailto links point to a valid email address. Also, it's possible to check to see that the server in an FTP link exists. With a little more code, you could check the link itself.
There are other protocols that are not covered by this script, including things like RST, telnet, and HTTPS. These can easily be added.
The basic framework is there, and with a little hacking it can easily be expanded.