Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12. Searching the Web Server

Allowing users to search for specific information on your web site is a very important and useful feature, and one that can save them from potential frustration trying to locate particular documents. The concept behind creating a search application is rather trivial: accept a query from the user, check it against a set of documents, and return those that match the specified query. Unfortunately, there are several issues that complicate the matter, the most significant of which is dealing with large document repositories. In such cases, it’s not practical to search through each and every document in a linear fashion, much like searching for a needle in a haystack. The solution is to reduce the amount of data we need to search by doing some of the work in advance.

This chapter will teach you how to implement different types of search engines, ranging from the trivial, which search documents on the fly, to the most complex, which are capable of intelligent searches.

Searching One by One

The very first example that we will look at is rather trivial in that it does not perform the actual search, but passes the query to the fgrep command and processes the results.

Before we go any further, here’s the HTML form that we will use to get the information from the user:

<HTML>
<HEAD>
    <TITLE>Simple 'Mindless' Search</TITLE>
</HEAD>
<BODY>
<H1>Are you ready to search?</H1>
<P>
<FORM ACTION="/cgi/grep_search1.cgi" METHOD="GET">
<INPUT TYPE="text" NAME="query" SIZE="20">
<INPUT TYPE="submit" VALUE="GO!">
</FORM>
</BODY>
</HTML>

As we mentioned above, the program is quite simple. It creates a pipe to the fgrep command and passes it the query, as well as options to perform case-insensitive searches and to return the matching filenames without any text. The program beautifies the output from fgrep by converting it to an HTML document and returns it to the browser.

fgrep returns the list of matched files in the following format:

/usr/local/apache/htdocs/how_to_script.html
/usr/local/apache/htdocs/i_need_perl.html
.
.

The program converts this to the following HTML list:

<LI><A HREF="/how_to_script.html" >how_to_script.html</A></LI>
<LI><A HREF="/i_need_perl.html">i_need_perl.html</A></LI>
.
.

Let’s look at the program now, as shown in Example 12.1.

Example 12-1. grep_search1.cgi

#!/usr/bin/perl -wT
# WARNING: This code has significant limitations; see description

use strict;
use CGI;
use CGIBook::Error;

# Make the environment safe to call fgrep
BEGIN {
    $ENV{PATH} = "/bin:/usr/bin";
    delete @ENV{ qw( IFS CDPATH ENV BASH_ENV ) };
}

my $FGREP         = "/usr/local/bin/fgrep";
my($DOCUMENT_ROOT) = $ENV{DOCUMENT_ROOT}=~/^([w:/\-]+)$/
   or die "Unsafe document root!";
my $VIRTUAL_PATH  = "";

my $q       = new CGI;
my $query   = $q->param( "query" );

$query =~ s/[^w ]//g;
$query =~ /([w ]+)/;
$query = $1;

if ( defined $query ) {
    error( $q, "Please specify a valid query!" );
}

my $results = search( $q, $query );

print $q->header( "text/html" ),
      $q->start_html( "Simple Search with fgrep" ),
      $q->h1( "Search for: $query" ),
      $q->ul( $results || "No matches found" ),
      $q->end_html;


sub search {
    my( $q, $query ) = @_;
    local *PIPE;
    my $matches = "";
    
    open PIPE, "$FGREP -il '$query' $DOCUMENT_ROOT/* |"
        or die "Cannot open fgrep: $!";
    
    while ( <PIPE> ) {
        chomp;
        s|.*/||;
        $matches .= $q->li(
                        $q->a( { href => "$VIRTUAL_PATH/$_" }, $_ )
                    );
    }
    close PIPE;
    return $matches;
}

We initialize three globals—$FGREP, $DOCUMENT_ROOT, and $VIRTUAL_PATH—which store the path to the fgrep binary, the search directory, and the virtual path to that directory, respectively. If you do not want the program to search the web server’s top-level document directory, you should change $DOCUMENT_ROOT to reflect the full path of the directory where you want to enable searches. If you do make such a change, you will also need to modify $VIRTUAL_PATH to reflect the URL path to the directory.

Because Perl will pass our fgrep command through a shell, we need to make sure that the query we send it is not going to cause any security problems. Let’s decide to allow only “words” (represented in Perl as “a-z”, “A-Z”, “0-9”, and “_”) and spaces in the search. We proceed to strip out all characters other than words and spaces and pass the result through an additional regular expression to untaint it. We need to do this extra step because, although we know the substitution really did make the data safe, a substitution is not sufficient to untaint the data for Perl. We could have skipped the substitution and just performed the regular expression match, but it means that if someone entered an invalid character, only that part of their query before the illegal character would be included in the search. By doing the substitution first, we can strip out illegal characters and perform a search on everything else.

After all this, if the query is not provided or is empty, we call our familiar error subroutine to notify the user of the error. We test whether it is defined first to avoid a warning for using an undefined variable.

We open a PIPE to the fgrep command for reading, which is the purpose of the trailing “|”. Notice how the syntax is not much different from opening a file. If the pipe succeeds, we can go ahead and read the results from the pipe.

The -il options force fgrep to perform case-insensitive searches and return the filenames (and not the matched lines). We make sure to quote the string in case the user is searching for a multiple word query.

Finally, the last argument to fgrep is a list of all the files that it should search. The shell expands ( globs) the wildcard character into a list of all the files in the specified directory. This can cause problems if the directory contains a large number of files, as some shells have internal glob limits. We will fix this problem in the next section.

The while loop iterates through the results, setting $_ to the current record each time through the loop. We strip the end-of-line character(s) and the directory information so we are left with just the filename. Then we create a list item containing a hypertext link to the item.

Finally, we print out our results.

How would you rate this application? It’s a simple search engine and it works well on a small collection of files, but it suffers from a few problems:

It calls an external application (fgrep) to handle the search, which makes it nonportable; Windows 95 for instance does not have a fgrep application.
Alphanumeric “symbols” are stripped from the search query, due to security concerns.
It could very well run into an internal glob limit when used with certain shells; some shells have limits as low as 256 files.
It does not search multiple directories.
It does not return content, but simply filename(s), although we could have added this functionality by not specifying the -l option.

So, let’s try again and create a better search engine.

Searching One by One, Take Two

The search engine we will create in this section is much improved. It no longer depends on fgrep to carry out the search, which also means that we no longer have to use the shell. And thus, we will not run into an internal glob limit.

In addition, this application returns the matched content and highlights the query, which makes it much more useful as well.

How does it work? It creates a list of all the HTML files in the specified directory using Perl’s own functions, and then iterates over each file searching for a line that contains a match for the query. All matches are stored in an array and are later converted to HTML.

Example 12.2 contains the new program.

Example 12-2. grep_search2.cgi

#!/usr/bin/perl -wT

use strict;
use CGI;
use CGIBook::Error;

my $DOCUMENT_ROOT = $ENV{DOCUMENT_ROOT};
my $VIRTUAL_PATH  = "";

my $q           = new CGI;
my $query       = $q->param( "query" );

if ( defined $query and length $query ) {
    error( $q, "Please specify a valid query!" );
}

$query = quotemeta( $query );
my $results = search( $q, $query );

print $q->header( "text/html" ),
      $q->start_html( "Simple Perl Search" ),
      $q->h1( "Search for: $query" ),
      $q->ul( $results || "No matches found" ),
      $q->end_html;


sub search {
    my( $q, $query ) = @_;
    my( %matches, @files, @sorted_paths, $results );
    
    local( *DIR, *FILE );
    
    opendir DIR, $DOCUMENT_ROOT or
        error( $q, "Cannot access search dir!" );
        
    @files = grep { -T "$DOCUMENT_ROOT/$_" } readdir DIR;
    closedir DIR;
    
    foreach my $file ( @files ) {
        my $full_path = "$DOCUMENT_ROOT/$file";
        
        open FILE, $full_path or
            error( $q, "Cannot process $file!" );
        
        while ( <FILE> ) {
            if ( /$query/io ) {
                $_ = html_escape( $_ );
                s|($query)|<B>$1</B>|gio;
                push @{ $matches{$full_path}{content} }, $_;
                $matches{$full_path}{file} = $file;
                $matches{$full_path}{num_matches}++;
            }
        }
        close FILE;
    }
    
    @sorted_paths = sort {
                        $matches{$b}{num_matches} <=>
                        $matches{$a}{num_matches} ||
                        $a cmp $b
                    } keys %matches;
    
    foreach my $full_path ( @sorted_paths ) {
        my $file        = $matches{$full_path}{file};
        my $num_matches = $matches{$full_path}{num_matches};
        
        my $link = $q->a( { -href => "$VIRTUAL_PATH/$file" }, $file );
        my $content = join $q->br, @{ $matches{$full_path}{content} };
        
        $results .= $q->p( $q->b( $link ) . " ($num_matches matches)" .
                           $q->br . $content
                    );
    }
    
    return $results;
}


sub html_escape {
    my( $text ) = @_;
    
    $text =~ s/&/&amp;/g;
    $text =~ s/</&lt;/g;
    $text =~ s/>/&gt;/g;
return $text;
}

This program starts out like our previous example. Since we are searching for the query without exposing it to the shell, we no longer have to strip out any characters from the query. Instead we escape any characters that may be interpreted in a regular expression by calling Perl’s quotemeta function.

The opendir function opens the specified directory and returns a handle that we can use to get a list of all the files in that directory. It’s a waste of time to search through binary files, such as sounds and images, so we use Perl’s grep function (not to be confused with the Unix grep and fgrep applications) to filter them out.

In this context, the grep function iterates over a list of filenames returned by readdir—setting $_ for each element—and evaluates the expression specified within the braces, returning only the elements for which the expression is true.

We are using readdir in an array context so that we can pass the list of all files in the directory to grep for processing. But there is a problem with this approach. The readdir function simply returns the name of the file and not the full path, which means that we have to construct a full path before we can pass it to the -T operator. We use the $DOCUMENT_ROOT variable to create the full path to the file.

The -T operator returns true if the file is a text file. After grep finishes processing all the files, @files will contain a list of all the text files.

We iterate through the @files array, setting $file to the current value each time through the loop. We proceed to open the file, making sure to return an error if we cannot open it, and iterate through it one line at a time.

The %matches hash contains three elements: file to store the name of the file, num_matches to store the number of matches, and a content array to hold all the lines containing matches. We need the filename for output purposes.

We use a simple case-insensitive regex to search for the query. The o option compiles the regex only once, which greatly improves the speed of the search. Note that this will cause problems for scripts running under mod_perl or FastCGI, which we’ll discuss later in Chapter 17.

If the line contains a match, we escape characters that could be mistaken for HTML tags. We then bold the matched text, increment the match counter by the number of matches, and push that line onto that file’s content array.

After we have finished looking through the files, we sort the results by the number of matches found in decreasing order and then alphabetically by path for those who have the same number of matches.

To generate our results, we walk through our sorted list. For each file, we create a link and display the number of matches and all the lines that matched the query. Since the content exists as individual elements in an array, we join all the elements together into one large string delimited by an HTML break tag.

Now, let us improve on this application a bit by allowing users to specify regular expression searches. We will not present the entire application, since it is very similar to the one we have just covered.

Regex-Based Search Engine

By allowing users to specify regular expressions in their search, we make the search engine much more powerful. For example, a user who wants to search for the recipe for Zwetschgendatschi (a Bavarian plum cake) from your online collection, but is not sure of the exact spelling, could simply enter Zwet.+?chi to find it.

In order to implement this functionality, we have to add several pieces to the search engine.

First, we need to modify the HTML file to provide an option for the user to turn the functionality on or off:

Regex Searching: 
    <INPUT TYPE="radio" NAME="regex" VALUE="on">On
    <INPUT TYPE="radio" NAME="regex" VALUE="off" CHECKED>Off

Then, we need to check for this value in the application and act accordingly. Here is the beginning of the new search script:

#!/usr/bin/perl -wT

use strict;

my $q     = new CGI;
my $regex = $q->param( "regex" );
my $query = $q->param( "query" );

unless ( defined $query and length $query ) {
    error( $q, "Please specify a query!" );
}

if ( $regex eq "on" ) {
    eval { /$query/o };
    error( $q, "Invalid Regex") if $@;
}
else {
    $query = quotemeta $query;
}

my $results = search( $q, $query );

print $q->header( "text/html" ),
      $q->start_html( "Simple Perl Regex Search" ),
      $q->h1( "Search for: $query" ),
      $q->ul( $results || "No matches found" ),
      $q->end_html;
.
.

The rest of the code remains the same. What we are doing differently here is checking if the user chose the “regex” option and if so, evaluating the user-specified regex at runtime using the eval function. We can check to see whether the regex is invalid by looking at the value stored in $@. Perl sets this variable if there is an error in the evaluated code. If the regex is valid, we can go ahead and use it directly, without quoting the specified metacharacters. If the “regex” option was not requested, we perform the search as before.

As you can see, both of these applications are much improved over the first one, but neither one of them is perfect. Since both of them are based on a linear search algorithm, the search process will be slow when dealing with directories that contain many files. They also search only one directory. They could be modified to recurse down through subdirectories, but that would decrease the performance even more. In the next section, we will look at an index-based approach that calls for creating a dictionary of relevant words in advance, and then searching it rather than the actual files.

Inverted Index Search

The applications that we’ve looked at so far search through each and every file in the specified directory, looking for particular words or phrases. This is not only time consuming, but will also place a great burden on the server. We clearly need a different approach to searching.

A more efficient approach is to create an index (like the one you can find at the back of this and other books) containing all the words from specific documents and the name of the document in which they appear.

In this section, we will discuss an application that creates an inverted index. The index is inverted in the sense that a particular word is used to find the file(s) in which it appears, rather than the other way around. In the following section, we will look at the CGI script that searches this index and presents the results in a nice format.

Example 12.3 creates the indexer.

Example 12-3. indexer.pl

#!/usr/bin/perl -wT
# This is not a CGI, so taint mode not required

use strict;

use File::Find;
use DB_File;
use Getopt::Long;
require "stem.pl";

use constant DB_CACHE      => 0;
use constant DEFAULT_INDEX => "/usr/local/apache/data/index.db";

my( %opts, %index, @files, $stop_words );

GetOptions( \%opts, "dir=s",
                    "cache=s",
                    "index=s",
                    "ignore",
                    "stop=s",
                    "numbers",
                    "stem" );

die usage(  ) unless $opts{dir} && -d $opts{dir};

$opts{'index'}        ||= DEFAULT_INDEX;
$DB_BTREE->{cachesize}  = $cache || DB_CACHE;

$index{"!OPTION:stem"} = 1 if $opts{'stem'};
$index{"!OPTION:ignore"} = 1 if $opts{'ignore'};

tie %index, "DB_File", $opts{'index'}, O_RDWR|O_CREAT, 0640, $DB_TREE
    or die "Cannot tie database: $!
";

find( sub { push @files, $File::Find::name }, $opts{dir} );
$stop_words = load_stopwords( $opts{stop} ) if $opts{stop};

process_files( \%index, @files, \%opts, $stop_words );

untie %index;


sub load_stopwords {
    my $file = shift;
    my $words = {};
    local *INFO, $_;
    
    die "Cannot file stop file: $file
" unless -e $file;
    
    open INFO, $file or die "$!
";
    while ( <INFO> ) {
        next if /^#/;
        $words->{lc $1} = 1 if /(S+)/;
    }
    
    close INFO;    
    return $words;
}

sub process_files {
    my( $index, $files, $opts, $stop_words ) = @_;
    local *FILE, $_;
    local $/ = "

";
    
    for ( my $file_id = 0; $file_id < @$files; $file_id++ ) {
        my $file = $files[$file_id];
        my %seen_in_file;
        
        next unless -T $file;
        
        print STDERR "Indexing $file
";
        $index->{"!FILE_NAME:$file_id"} = $file;
        
        open FILE, $file or die "Cannot open file: $file!
";
        
        while ( <FILE> ) {
            
            tr/A-Z/a-z/ if $opts{ignore};
            s/<.+?>//gs; # Woa! what about < or > in comments or js??
            
            while ( /([a-zd]{2,})/gi ) {
                my $word = $1;
                next if $stop_words->{lc $word};
                next if $word =~ /^d+$/ && not $opts{number};
                
                ( $word ) = stem( $word ) if $opts{stem};
                
                $index->{$word} = ( exists $index->{$word} ? 
                    "$index->{$word}:" : "" ) . "$file_id" unless 
                    $seen_in_file{$word}++;
            }
        }
    }
}

sub usage {
    my $usage = <<End_of_Usage;

Usage: $0 -dir directory [options]

The options are:

  -cache         DB_File cache size (in bytes)
  -index         Path to index, default:/usr/local/apache/data/index.db
  -ignore        Case-insensitive index
  -stop          Path to stopwords file
  -numbers       Include numbers in index
  -stem          Stem words

End_of_Usage
    return $usage;
}

We will use File::Find to get a list of all the files in the specified directory, as well as files in any subdirectories. The File::Basename module provides us with functions to extract the filename, given the full path. You might be wondering why we need this feature, considering the fact that we can use a simple regular expression to get at the filename. If we use a regex, we will have to determine what platform we’re using the application on, and accordingly extract the filename. This module takes care of that for us.

We use DB_File to create and store the index. Note that we could also store the index in an RDBMS, although a DBM file is certainly adequate for many sites. The method for creating indexes is the same no matter what type of format we use for storage. Getopt::Long helps us handle command-line options, and stem.pl , a Perl 4 library, has algorithms to automatically “stem” (or remove) word suffixes.

We use the DB_CACHE constant to hold the size of the DB_File memory cache. Increasing the size of this cache (up to a certain point) improves insertion rate at the expense of memory. In other words, it increases the rate at which we store the words in the index. A cache size of is used as the default.

DEFAULT_INDEX contains the default path to the file that will hold our data. The user can specify a different file by using the -index option, as you will see shortly.

The GetOptions function (part of the Getopt::Long module) allows us to extract any command-line options and store them in a hash. We pass a reference to a hash and a list of options to GetOptions. The options that take arguments contain an “s” to indicate that they each take a string.

This application allows you to pass several options that will affect the indexing process. The -dir option is the only one that is required, as it is used to specify the directory that contains the files to be indexed.

You can use the -cache option to specify the cache size and -index to specify the path to the index. The -ignore option creates an index where all the words are turned into lowercase (case-insensitive). This will increase the rate at which the index is created, as well as decrease the size of the index. If you want numbers in documents to be included in the index, you can specify the -numbers option.

You can use the -stop option to specify a file that contains “stop” words—words that are generally found in most of your documents. Typical stop words include “a”, “an”, “to”, “it”, and “the”, but you can also include words that are more specific to your documents.

Finally, the -stem option stems word suffixes before storing them in the index. This will help us find words in documents much easily. For example, if a user searches for “tomatoes”, our search application will return documents that contain “tomato” as well as “tomatoes”. An important note here is that stemming will also create a case-insensitive index.

Here’s an example of how you would use these various options:

$ ./Indexer -dir    /usr/local/apache/htdocs/sports 
            -cache  16_000_000 
            -index  /usr/local/apache/data/sports.db 
            -stop   my_stop_words.txt 
            -stem

%index is the hash that will hold the index. We use the tie function to bind the hash to the file specified by $opts{index}. This allows us to transparently store a hash in a file, which we can later retrieve and modify. In this example, we are using DB_File, as it is faster and more efficient that other DBM implementations.

If the -stem option was used, we record this in our index so that our CGI script knows whether to apply stemming to the query as well. We could have stored this information in another database file, but that would require opening two files for each search. Instead, we name this key with an exclamation point such that it can’t collide with any of the words we’re indexing.

We use the find function (part of File::Find module) to get a list of all the files in the specified directory. find expects the first argument to be a code reference, which can either be a reference to a subroutine or an inlined anonymous subroutine, as is the case above. As find traverses through the directory (as well as all subdirectories), it executes the code, specified by the first argument, setting the $File::Find::name variable to the path of the file. This builds an array of the path to all the files under the original directory.

If a stop file was specified and it exists, we call the load_stopwords function to read through the file and return a reference to a hash.

The most important function in this application is process_files, which iterates through all the files and stores the words in $index. Finally, we close the binding between the hash and the file and exit. At this point, we will have a file containing the index.

Let’s look at the functions now. The load_stopwords function opens the stop words file, ignores all comments (lines starting with “#”), and extracts the first word found on each line (S+).

The word is converted to lowercase by the lc function and stored as a key in the hash referenced by $words. Since we are going to find words with mixed case in our files, it is much easier and quicker to compare them to this list if all our stop words are either completely uppercase or completely lowercase.

Before we discuss the process_files method, let’s look at the arguments it expects. The first argument, $index, is a reference to an empty hash that will eventually contain the words from all the files as well as pointers to the documents where they are found. $files is a reference to a list of all the files to parse. $stop is a reference to a hashes containing our stop words. The final argument, $args, is simply a reference to the hash of our command-line arguments.

If the user chose to ignore case, we convert all words into lowercase, thus creating a case-insensitive index.

We set Perl’s default input record separator, $/, to paragraph mode. In other words, one read on a file handle will return a paragraph, as opposed to a single line. This allows us to index the files at a faster rate.

We iterate through the @$files array with the for function, storing the key in $file_id and the value of the current file in $file. Since this application creates a human-searchable index, we will deal only with text files. We use the -T operator to ignore any non-text files.

The first entry into the %$index hash is a “unique” key that associates a number with the full path to the file. Since this hash will also hold all the words that we find, we use the “!FILE_NAME” string to keep our number to file mappings separate from the words.

We start our indexing process by iterating through the file a paragraph at a time; the $_ variable holds the contents. If the -case option was specified by the user, we convert the paragraph that we have just read to lowercase.

We also strip all HTML tags from the paragraph, since we don’t want them to be indexed. The regexp will look for a string starting with “<”, followed by one or more characters (including newlines) until it finds the first “>”.

We iterate through the paragraph using a regex that extracts words greater than or equal to two characters in length and matches characters as well as digits (d matches “0-9”). The matched word is stored in $1.

Before we check to see if the word we extracted is a stop word, we need to convert it to lowercase, since we converted all the stop words to lowercase earlier in this script. If the word is, indeed, a stop word, we skip it and continue. We also skip numbers if the -numbers option is not specified.

If the -stem option is specified, we call the stem function (part of the stem.pl library) to remove all prefixes from the word and convert it to lowercase.

Finally, we are ready to store the word in the index, where the value represents the file that we are currently parsing. Unfortunately, this isn’t that simple. The last command is a little long and complicated. It helps to read it backwards. First, we check whether we have seen the word in this file previously by using the %seen_in_file hash; the first time through, there will not be an entry in the hash and will evaluate to false (and thus pass the unless check), thereafter, it will contain the number of times we have seen the number in the file and evaluate to true (and thus fail the unless check). So the first time we see the word in the file, we add it to our index. If the word was previously indexed for another file, then we join the $file_id of this file to the previous entry with a colon. Otherwise, we just add $file_id as this word’s only value thus far.

When this function finishes, the %$index hash will look something like this:

$index = {
              "!FILE_NAME:1"     => 
                  "/usr/local/apache/htdocs/sports/sprint.html",
              "!FILE_NAME:2"     =>
                  "/usr/local/apache/htdocs/sports/olympics.html",
              "!FILE_NAME:3"     => 
                  "/usr/local/apache/htdocs/sports/celtics.html",
              browser              => "1:2",
              code                 => "3",
              color                => "2:3",
              comment              => "2",
              content              => "1",
              cool                 => "2:3",
              copyright            => "1:2:3"
          };

Now, we are ready to implement the CGI application that will search this index.

Search Application

The indexer application makes our life easier when it comes time to write the CGI application to perform the actual search. The CGI application should parse the form input, open the DBM file created by the indexer, search for possible matches and then return HTML output.

Example 12.4 contains the program.

Example 12-4. indexed_search.cgi

#!/usr/bin/perl -wT

use DB_File;
use CGI;
use CGIBook::Error;
use File::Basename;
require stem.pl;

use strict;

use constant INDEX_DB => "/usr/local/apache/data/index.db";

my( %index, $paths, $path );

my $q     = new CGI;
my $query = $q->param("query");
my @words = split /s*(,|s+)/, $query;

tie %index, "DB_File", INDEX_DB, O_RDONLY, 0640
    or error( $q, "Cannot open database" );

$paths = search( \%index, @words );

print $q->header,
      $q->start_html( "Inverted Index Search" ),
      $q->h1( "Search for: $query" );

unless ( @$paths ) {
    print $q->h2( $q->font( { -color => "#FF000" }, 
                            "No Matches Found" ) );
}

foreach $path ( @$paths ) {
    next unless $path =~ s/^Q$ENV{DOCUMENT_ROOT}E//o;
    $path = to_uri_path( $path );
    print $q->a( { -href => "$path" }, "$path" ), $q->br;
} 

print $q->end_html;
untie %index;



sub search {
    my( $index, $words ) = @_;
    my $do_stemming = exists $index->{"!OPTION:stem"} ? 1 : 0;
    my $ignore_case = exists $index->{"!OPTION:ignore"} ? 1 : 0;
    my( %matches, $word, $file_index );
    
    foreach $word ( @$words ) {
        my $match;
        
        if ( $do_stemming ) {
            my( $stem )  = stem( $word );
            $match = $index->{$stem};
        }
        elsif ( $ignore_case ) {
            $match = $index->{lc $word};
        }
        else {
            $match = $index->{$word};
        }
        
        next unless $match;
        
        foreach $file_index ( split /:/, $match ) {
            my $filename = $index->{"!FILE_NAME:$file_index"};
            $matches{$filename}++;
        }
    }
    my @files = map  { $_->[0] }
                sort { $matches{$a->[0]} <=> $matches{$b->[0]} || 
                       $a->[1] <=> $b->[1] }
                map  { [ $_, -M $_ ] }
                keys %matches;
    
    return @files;
}

sub to_uri_path {
    my $path = shift;
    my( $name, @elements );
    
    do {
        ( $name, $path ) = fileparse( $path );
        unshift @elements, $name;
        chop $path;
    } while $path;
    
    return join '/', @elements;
}

The modules should be familiar to you by now. The INDEX_DB constant contains the path of the index created by the indexer application.

Since a query can include multiple words, we split it on any whitespace or a comma and store the resulting words in the @words array. We use tie to open the index DBM file in read-only mode. In other words, we bind the index file with the %index hash. If we cannot open the file, we call our error function to return an error to the browser.

The real searching is done appropriately enough in the search function, which takes a reference to the index hash and a reference to the list of words we are searching for. The first thing we do is to peek into the index and see if the stem option was set when the index was built. We then proceed to iterate through the @$words array, searching for possible matches. If stemming was enabled, we stem the word and compare that. Otherwise, we check to see whether the particular word exists in the index as-is, or as a lowercase word if the index is not case-sensitive. If any of these comparisons succeeds, we have got a match. Otherwise, we ignore the word and continue.

If there is a match, we split the colon separated list of file id’s where that particular word is found. Since we don’t want duplicate entries in our final list, we store the full path of the matching files in the %matches hash.

After the loop has finished executing, we are left with the matching files in %matches. We would like to add some order to our results and display them according to the number of words matching and then by the file’s modification time. So, we sort the keys according to the number of matches and then by the data returned by the -M operator, and store the recently modified files in the @files array.

We could calculate the modification time of the files during each comparison like this:

my @files = sort { $matches{$_} <=> $matches{$_} ||
                   -M $_ <=> -M $_ }
            keys %matches;

However, this is inefficient because we might calculate the modification time for each file multiple times. A more efficient algorithm involves precalculating the modification times as we have done in the program.

This strategy has become known as the Schwartzian Transform, made famous by Randal Schwartz. It’s beyond the scope of this book to explain this, but if you’re interested, see Joseph Hall’s explanation of the Transform, located at: http://www.5sigma.com/perl/schwtr.html. Ours is a slight variation because we perform a two-part sort.

We output the HTTP and HTML document headers, and proceed to check to see if we have any matches. If not, we return a simple message. Otherwise, we iterate through the @files array, setting $path to the current element each time through the loop. We strip off the part of the path that matches the server’s root directory. That should give us the path that corresponds to a URL. However, on non-Unix filesystems, we won’t have forward slashes (“/”) separating directories. So we call the to_uri_path function, which uses the File::Basename module to strip off successive elements of the path and then rebuild it with forward slashes. Note that this will work on many operating systems like Win32 and MacOS, but it will not work on systems that do not use a single character to delimit parts of the path (like VMS; although, the chances that you’re actually doing CGI development on a VMS machine are pretty slim).

We build proper links with this newly formatted path, print the remainder of our results, close the binding between the database and the hash, and exit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Searching the Web Server

Create new playlist

Sign In