Orphan Detection

You now have a list of all the files accessible (directly or indirectly) from the top web page. This list is made up of the keys of the %file_seen hash. You need to know the names of all the files on the system to do orphan checking.

This is accomplished with the File::Find module. Actually, the program uses your old friend find2perl to create the code that calls this module and then adapts it to your needs.

The find is kicked off using the statement

160    # Traverse desired filesystems 
161    File::Find::find({wanted => &wanted}, dirname($ARGV[0]));

The wanted function just puts the files on the list:

76    sub wanted {
77        if (–f "$name") {
78            push(@full_file_list, $name); 
79        } 
80        return (1); 
81    }

Now that you have two lists, one of the files accessible on the web site, and the other containing all the files on the web site, comparing the two lists produces a list of orphans:

168    print "Orphan Files
"; 
169    foreach my $cur_file (sort @full_file_list) 
170    {
171        if (not defined($file_seen{$cur_file})) {
172            print "	$cur_file
"; 
173        } 
174    }

External Link Checking

During the file processing, you stored away a list of external links (@external_links). To check to see whether these are valid, you must go through them and use the LWP::Simple module to check the links.

Actually, what you try to do is use the head function out of this module to retrieve the header. If the header exists, then the link is valid:

181    print "Broken External Links
"; 
182    foreach my $cur_file (sort @external_links) {
183        if (not (head($cur_file–>{link}))) {
184            print "	$cur_file–>{file} => $cur_file–>{link}
"; 
185        } 
186    }

You could have grabbed the entire web page using the get function, but the header works just as well and uses less bandwidth.

Summary of Web Site Checker

I originally wrote the linkcheck.pl program in C++. It was several thousand lines long. As you can see in Listing 16.8, the Perl version of this program is only 186 lines long.

The reason for this vast difference is that Perl supplies modules that do most of the work. Also, the rich set of language features means that you can concentrate on getting the logic right and not have to worry about memory allocations, array bounds, and other such details—and therein lies the true power of Perl.

Listing 16.8. link-check.pl
  1 
  2    =pod 
  3 
  4    =head1 NAME 
  5 
  6    link–check.pl – Check the links in a local web site 
  7 
  8    =head1 SYNOPSIS 
  9 
 10        perl link–check.pl <top–level–file> 
 11 
 12    =head1 DESCRIPTION 
 13 
 14    The I<link–check.pl> program checks a web site for: 
 15 
 16    =over 4 
 17 
 18    =item * 
 19 
 20    Broken internal links 
 21 
 22    =item * 
 23 
 24    Orphan files (files that no one links to) 
 25 
 26    =item * 
 27 
 28    Broken external links 
 29 
 30    =back 
 31 
 32    The program first makes a catalog of all the links in 
 33    the I<top–level–file> and all the files linked to by 
 34    all the files accessible from the I<top–level–file>. 
 35    (This includes all the directly accessible files as well 
 36    as all the files that are accessed through multiple links.) 
 37 
 38    It then goes through the directory tree in which the 
 39    I<top–level–file> resides and gets a list of all the 
 40    available files.   Any file that's in the directory that's 
 41    not accessible from I<top–level–file> is considered an 
 42    orphan. 
 43 
 44    =cut 
 45    # 
 46    # Usage: link_check <top–file> 
 47    # 
 48    use strict; 
 49    use warnings; 
 50 
 51    use HTML::SimpleLinkExtor; 
 52    use LWP::Simple; 
 53    use File::Basename; 
 54    use File::Spec::Functions; 
 55    use File::Find (); 
 56 
 57    # Generated by find2pl 
 58    # for the convenience of &wanted calls, including –eval statements: 
 59    use vars qw/*name *dir *prune/; 
 60    *name   = *File::Find::name; 
 61    *dir    = *File::Find::dir; 
 62    *prune  = *File::Find::prune; 
 63 
 64    my %file_seen = ();     # True if we've seen a file 
 65    my @external_links = ();# List of external links 
 66 
 67    my @bad_files = ();     # Files we did not see 
 68    my @full_file_list = ();# List of all the files 
 69 
 70 
 71    ######################################################## 
 72    # wanted –– Called by the find routine, this returns 
 73    #       true if the file is wanted.  As a side effect 
 74    #       it records any normal file seen in "full_file_list". 
 75    ######################################################## 
 76    sub wanted {
 77        if (–f "$name") {
 78            push(@full_file_list, $name); 
 79        } 
 80        return (1); 
 81    } 
 82 
 83    ######################################################## 
 84    # process_file($file) 
 85    # 
 86    # Read an html file and extract the tags. 
 87    # 
 88    # If the file does not exist, put it in the list of 
 89    # bad files. 
 90    ######################################################## 
 91    sub process_file($);    # Needed because this is recursive 
 92    sub process_file($) 
 93    {
 94        my $file_name = shift;      # The file to process 
 95        my $dir_name = dirname($file_name); 
 96 
 97        # Did we do it already 
 98        if ($file_seen{$file_name}) {
 99            return; 
100        } 
101        $file_seen{$file_name} = 1; 
102        if (! –f $file_name) {
103            push(@bad_files, $file_name); 
104            return; 
105        } 
106 
107        if (($file_name !~ /.html$/) and ($file_name !~ /.htm$/)) {
108            return; 
109        } 
110        # The parser object to extract the list 
111        my $extractor = HTML::SimpleLinkExtor–>new(); 
112 
113        # Parse the file 
114        $extractor–>parse_file($file_name); 
115 
116        # The list of all the links in the file 
117        my @all_links = $extractor–>links(); 
118 
119        # Check each link 
120        foreach my $cur_link (@all_links) {
121 
122            # Is the link external 
123            if ($cur_link =~ /^http:///) {
124                # Put it on the list of external links 
125                push(@external_links, {
126                    file => $file_name, 
127                    link => $cur_link}); 
128                next; 
129            } 
130            # Get the name of the file 
131            my $next_file = "$dir_name/$cur_link"; 
132 
133            # Remove any funny characters in the name 
134            $next_file = File::Spec–>canonpath($next_file); 
135 
136            # Follow the links in this file 
137            process_file($next_file); 
138        } 
139    } 
140 
141    if ($#ARGV != 0) {
142        print STDERR "Usage: $0 <top–file>
"; 
143        exit (8); 
144    } 
145 
146    # Top level file 
147    my $top_file = $ARGV[0]; 
148    if (–d $top_file) {
149        $top_file .= "/index.html"; 
150    } 
151    if (! –f $top_file) {
152        print STDERR "ERROR: No such file $top_file
"; 
153        exit (8); 
154    } 
155 
156 
157    # Scan all the links 
158    process_file($top_file); 
159 
160    # Traverse desired filesystems 
161    File::Find::find({wanted => &wanted}, dirname($ARGV[0])); 
162 
163    print "External links
"; 
164    foreach my $link (@external_links) {
165        print "	$link
"; 
166    } 
167 
168    print "Orphan Files
"; 
169    foreach my $cur_file (sort @full_file_list) 
170    {
171        if (not defined($file_seen{$cur_file})) {
172            print "	$cur_file
"; 
173        } 
174    } 
175 
176    print "Broken Internal Links
"; 
177    foreach my $cur_file (sort @bad_files) 
178    {
179        print "	$cur_file
"; 
180    } 
181    print "Broken External Links
"; 
182    foreach my $cur_file (sort @external_links) {
183        if (not (head($cur_file–>{link}))) {
184            print "	$cur_file–>{file} => $cur_file–>{link}
"; 
185        } 
186    }

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset