You now have a list of all the files accessible (directly or indirectly) from the top web page. This list is made up of the keys of the %file_seen hash. You need to know the names of all the files on the system to do orphan checking.
This is accomplished with the File::Find module. Actually, the program uses your old friend find2perl to create the code that calls this module and then adapts it to your needs.
The find is kicked off using the statement
160 # Traverse desired filesystems 161 File::Find::find({wanted => &wanted}, dirname($ARGV[0]));
The wanted function just puts the files on the list:
76 sub wanted { 77 if (–f "$name") { 78 push(@full_file_list, $name); 79 } 80 return (1); 81 }
Now that you have two lists, one of the files accessible on the web site, and the other containing all the files on the web site, comparing the two lists produces a list of orphans:
168 print "Orphan Files "; 169 foreach my $cur_file (sort @full_file_list) 170 { 171 if (not defined($file_seen{$cur_file})) { 172 print " $cur_file "; 173 } 174 }
During the file processing, you stored away a list of external links (@external_links). To check to see whether these are valid, you must go through them and use the LWP::Simple module to check the links.
Actually, what you try to do is use the head function out of this module to retrieve the header. If the header exists, then the link is valid:
181 print "Broken External Links "; 182 foreach my $cur_file (sort @external_links) { 183 if (not (head($cur_file–>{link}))) { 184 print " $cur_file–>{file} => $cur_file–>{link} "; 185 } 186 }
You could have grabbed the entire web page using the get function, but the header works just as well and uses less bandwidth.
I originally wrote the link–check.pl program in C++. It was several thousand lines long. As you can see in Listing 16.8, the Perl version of this program is only 186 lines long.
The reason for this vast difference is that Perl supplies modules that do most of the work. Also, the rich set of language features means that you can concentrate on getting the logic right and not have to worry about memory allocations, array bounds, and other such details—and therein lies the true power of Perl.
1 2 =pod 3 4 =head1 NAME 5 6 link–check.pl – Check the links in a local web site 7 8 =head1 SYNOPSIS 9 10 perl link–check.pl <top–level–file> 11 12 =head1 DESCRIPTION 13 14 The I<link–check.pl> program checks a web site for: 15 16 =over 4 17 18 =item * 19 20 Broken internal links 21 22 =item * 23 24 Orphan files (files that no one links to) 25 26 =item * 27 28 Broken external links 29 30 =back 31 32 The program first makes a catalog of all the links in 33 the I<top–level–file> and all the files linked to by 34 all the files accessible from the I<top–level–file>. 35 (This includes all the directly accessible files as well 36 as all the files that are accessed through multiple links.) 37 38 It then goes through the directory tree in which the 39 I<top–level–file> resides and gets a list of all the 40 available files. Any file that's in the directory that's 41 not accessible from I<top–level–file> is considered an 42 orphan. 43 44 =cut 45 # 46 # Usage: link_check <top–file> 47 # 48 use strict; 49 use warnings; 50 51 use HTML::SimpleLinkExtor; 52 use LWP::Simple; 53 use File::Basename; 54 use File::Spec::Functions; 55 use File::Find (); 56 57 # Generated by find2pl 58 # for the convenience of &wanted calls, including –eval statements: 59 use vars qw/*name *dir *prune/; 60 *name = *File::Find::name; 61 *dir = *File::Find::dir; 62 *prune = *File::Find::prune; 63 64 my %file_seen = (); # True if we've seen a file 65 my @external_links = ();# List of external links 66 67 my @bad_files = (); # Files we did not see 68 my @full_file_list = ();# List of all the files 69 70 71 ######################################################## 72 # wanted –– Called by the find routine, this returns 73 # true if the file is wanted. As a side effect 74 # it records any normal file seen in "full_file_list". 75 ######################################################## 76 sub wanted { 77 if (–f "$name") { 78 push(@full_file_list, $name); 79 } 80 return (1); 81 } 82 83 ######################################################## 84 # process_file($file) 85 # 86 # Read an html file and extract the tags. 87 # 88 # If the file does not exist, put it in the list of 89 # bad files. 90 ######################################################## 91 sub process_file($); # Needed because this is recursive 92 sub process_file($) 93 { 94 my $file_name = shift; # The file to process 95 my $dir_name = dirname($file_name); 96 97 # Did we do it already 98 if ($file_seen{$file_name}) { 99 return; 100 } 101 $file_seen{$file_name} = 1; 102 if (! –f $file_name) { 103 push(@bad_files, $file_name); 104 return; 105 } 106 107 if (($file_name !~ /.html$/) and ($file_name !~ /.htm$/)) { 108 return; 109 } 110 # The parser object to extract the list 111 my $extractor = HTML::SimpleLinkExtor–>new(); 112 113 # Parse the file 114 $extractor–>parse_file($file_name); 115 116 # The list of all the links in the file 117 my @all_links = $extractor–>links(); 118 119 # Check each link 120 foreach my $cur_link (@all_links) { 121 122 # Is the link external 123 if ($cur_link =~ /^http:///) { 124 # Put it on the list of external links 125 push(@external_links, { 126 file => $file_name, 127 link => $cur_link}); 128 next; 129 } 130 # Get the name of the file 131 my $next_file = "$dir_name/$cur_link"; 132 133 # Remove any funny characters in the name 134 $next_file = File::Spec–>canonpath($next_file); 135 136 # Follow the links in this file 137 process_file($next_file); 138 } 139 } 140 141 if ($#ARGV != 0) { 142 print STDERR "Usage: $0 <top–file> "; 143 exit (8); 144 } 145 146 # Top level file 147 my $top_file = $ARGV[0]; 148 if (–d $top_file) { 149 $top_file .= "/index.html"; 150 } 151 if (! –f $top_file) { 152 print STDERR "ERROR: No such file $top_file "; 153 exit (8); 154 } 155 156 157 # Scan all the links 158 process_file($top_file); 159 160 # Traverse desired filesystems 161 File::Find::find({wanted => &wanted}, dirname($ARGV[0])); 162 163 print "External links "; 164 foreach my $link (@external_links) { 165 print " $link "; 166 } 167 168 print "Orphan Files "; 169 foreach my $cur_file (sort @full_file_list) 170 { 171 if (not defined($file_seen{$cur_file})) { 172 print " $cur_file "; 173 } 174 } 175 176 print "Broken Internal Links "; 177 foreach my $cur_file (sort @bad_files) 178 { 179 print " $cur_file "; 180 } 181 print "Broken External Links "; 182 foreach my $cur_file (sort @external_links) { 183 if (not (head($cur_file–>{link}))) { 184 print " $cur_file–>{file} => $cur_file–>{link} "; 185 } 186 } |