Allowing users to search for specific information on your web site is a very important and useful feature, and one that can save them from potential frustration trying to locate particular documents. The concept behind creating a search application is rather trivial: accept a query from the user, check it against a set of documents, and return those that match the specified query. Unfortunately, there are several issues that complicate the matter, the most significant of which is dealing with large document repositories. In such cases, it’s not practical to search through each and every document in a linear fashion, much like searching for a needle in a haystack. The solution is to reduce the amount of data we need to search by doing some of the work in advance.
This chapter will teach you how to implement different types of search engines, ranging from the trivial, which search documents on the fly, to the most complex, which are capable of intelligent searches.
The very first example that we will look at is rather trivial in that it does not perform the actual search, but passes the query to the fgrep command and processes the results.
Before we go any further, here’s the HTML form that we will use to get the information from the user:
<HTML> <HEAD> <TITLE>Simple 'Mindless' Search</TITLE> </HEAD> <BODY> <H1>Are you ready to search?</H1> <P> <FORM ACTION="/cgi/grep_search1.cgi" METHOD="GET"> <INPUT TYPE="text" NAME="query" SIZE="20"> <INPUT TYPE="submit" VALUE="GO!"> </FORM> </BODY> </HTML>
As we mentioned above, the program is quite simple. It creates a pipe to the fgrep command and passes it the query, as well as options to perform case-insensitive searches and to return the matching filenames without any text. The program beautifies the output from fgrep by converting it to an HTML document and returns it to the browser.
fgrep returns the list of matched files in the following format:
/usr/local/apache/htdocs/how_to_script.html /usr/local/apache/htdocs/i_need_perl.html . .
The program converts this to the following HTML list:
<LI><A HREF="/how_to_script.html" >how_to_script.html</A></LI> <LI><A HREF="/i_need_perl.html">i_need_perl.html</A></LI> . .
Let’s look at the program now, as shown in Example 12.1.
Example 12-1. grep_search1.cgi
#!/usr/bin/perl -wT # WARNING: This code has significant limitations; see description use strict; use CGI; use CGIBook::Error; # Make the environment safe to call fgrep BEGIN { $ENV{PATH} = "/bin:/usr/bin"; delete @ENV{ qw( IFS CDPATH ENV BASH_ENV ) }; } my $FGREP = "/usr/local/bin/fgrep"; my($DOCUMENT_ROOT) = $ENV{DOCUMENT_ROOT}=~/^([w:/\-]+)$/ or die "Unsafe document root!"; my $VIRTUAL_PATH = ""; my $q = new CGI; my $query = $q->param( "query" ); $query =~ s/[^w ]//g; $query =~ /([w ]+)/; $query = $1; if ( defined $query ) { error( $q, "Please specify a valid query!" ); } my $results = search( $q, $query ); print $q->header( "text/html" ), $q->start_html( "Simple Search with fgrep" ), $q->h1( "Search for: $query" ), $q->ul( $results || "No matches found" ), $q->end_html; sub search { my( $q, $query ) = @_; local *PIPE; my $matches = ""; open PIPE, "$FGREP -il '$query' $DOCUMENT_ROOT/* |" or die "Cannot open fgrep: $!"; while ( <PIPE> ) { chomp; s|.*/||; $matches .= $q->li( $q->a( { href => "$VIRTUAL_PATH/$_" }, $_ ) ); } close PIPE; return $matches; }
We initialize three globals—$FGREP
,
$DOCUMENT_ROOT
, and
$VIRTUAL_PATH
—which store the path to the
fgrep binary, the search directory, and the
virtual path to that directory, respectively. If you do not want the
program to search the web server’s top-level document
directory, you should change $DOCUMENT_ROOT
to
reflect the full path of the directory where you want to enable
searches. If you do make such a change, you will also need to modify
$VIRTUAL_PATH
to reflect the URL path to the
directory.
Because Perl will pass our fgrep command through a shell, we need to make sure that the query we send it is not going to cause any security problems. Let’s decide to allow only “words” (represented in Perl as “a-z”, “A-Z”, “0-9”, and “_”) and spaces in the search. We proceed to strip out all characters other than words and spaces and pass the result through an additional regular expression to untaint it. We need to do this extra step because, although we know the substitution really did make the data safe, a substitution is not sufficient to untaint the data for Perl. We could have skipped the substitution and just performed the regular expression match, but it means that if someone entered an invalid character, only that part of their query before the illegal character would be included in the search. By doing the substitution first, we can strip out illegal characters and perform a search on everything else.
After all this, if the query is not provided or is empty, we call our familiar error subroutine to notify the user of the error. We test whether it is defined first to avoid a warning for using an undefined variable.
We open a PIPE to the fgrep command for reading, which is the purpose of the trailing “|”. Notice how the syntax is not much different from opening a file. If the pipe succeeds, we can go ahead and read the results from the pipe.
The -il options force fgrep to perform case-insensitive searches and return the filenames (and not the matched lines). We make sure to quote the string in case the user is searching for a multiple word query.
Finally, the last argument to fgrep is a list of all the files that it should search. The shell expands ( globs) the wildcard character into a list of all the files in the specified directory. This can cause problems if the directory contains a large number of files, as some shells have internal glob limits. We will fix this problem in the next section.
The while
loop iterates through the results,
setting $_
to the current record each time through
the loop. We strip the end-of-line character(s) and the directory
information so we are left with just the filename. Then we create a
list item containing a hypertext link to the item.
Finally, we print out our results.
How would you rate this application? It’s a simple search engine and it works well on a small collection of files, but it suffers from a few problems:
It calls an external application (fgrep) to handle the search, which makes it nonportable; Windows 95 for instance does not have a fgrep application.
Alphanumeric “symbols” are stripped from the search query, due to security concerns.
It could very well run into an internal glob limit when used with certain shells; some shells have limits as low as 256 files.
It does not search multiple directories.
It does not return content, but simply filename(s), although we could have added this functionality by not specifying the -l option.
The search engine we will create in this section is much improved. It no longer depends on fgrep to carry out the search, which also means that we no longer have to use the shell. And thus, we will not run into an internal glob limit.
In addition, this application returns the matched content and highlights the query, which makes it much more useful as well.
How does it work? It creates a list of all the HTML files in the specified directory using Perl’s own functions, and then iterates over each file searching for a line that contains a match for the query. All matches are stored in an array and are later converted to HTML.
Example 12.2 contains the new program.
Example 12-2. grep_search2.cgi
#!/usr/bin/perl -wT use strict; use CGI; use CGIBook::Error; my $DOCUMENT_ROOT = $ENV{DOCUMENT_ROOT}; my $VIRTUAL_PATH = ""; my $q = new CGI; my $query = $q->param( "query" ); if ( defined $query and length $query ) { error( $q, "Please specify a valid query!" ); } $query = quotemeta( $query ); my $results = search( $q, $query ); print $q->header( "text/html" ), $q->start_html( "Simple Perl Search" ), $q->h1( "Search for: $query" ), $q->ul( $results || "No matches found" ), $q->end_html; sub search { my( $q, $query ) = @_; my( %matches, @files, @sorted_paths, $results ); local( *DIR, *FILE ); opendir DIR, $DOCUMENT_ROOT or error( $q, "Cannot access search dir!" ); @files = grep { -T "$DOCUMENT_ROOT/$_" } readdir DIR; closedir DIR; foreach my $file ( @files ) { my $full_path = "$DOCUMENT_ROOT/$file"; open FILE, $full_path or error( $q, "Cannot process $file!" ); while ( <FILE> ) { if ( /$query/io ) { $_ = html_escape( $_ ); s|($query)|<B>$1</B>|gio; push @{ $matches{$full_path}{content} }, $_; $matches{$full_path}{file} = $file; $matches{$full_path}{num_matches}++; } } close FILE; } @sorted_paths = sort { $matches{$b}{num_matches} <=> $matches{$a}{num_matches} || $a cmp $b } keys %matches; foreach my $full_path ( @sorted_paths ) { my $file = $matches{$full_path}{file}; my $num_matches = $matches{$full_path}{num_matches}; my $link = $q->a( { -href => "$VIRTUAL_PATH/$file" }, $file ); my $content = join $q->br, @{ $matches{$full_path}{content} }; $results .= $q->p( $q->b( $link ) . " ($num_matches matches)" . $q->br . $content ); } return $results; } sub html_escape { my( $text ) = @_; $text =~ s/&/&/g; $text =~ s/</</g; $text =~ s/>/>/g; return $text; }
This program starts out like our previous example. Since we are
searching for the query without exposing it to the shell, we no
longer have to strip out any characters from the query. Instead we
escape any characters that may be interpreted in a regular expression
by calling
Perl’s
quotemeta
function.
The
opendir
function opens the specified
directory and returns a handle that we can use to get a list of all
the files in that directory. It’s a waste of time to search
through
binary files, such as sounds and
images, so we use Perl’s grep
function
(not to be confused with the Unix grep and
fgrep applications) to filter them out.
In this context, the grep function iterates over a
list of filenames returned by readdir
—setting $_
for each
element—and evaluates the expression specified within the
braces, returning only the elements for which the expression is true.
We are using readdir
in an array context so that
we can pass the list of all files in the directory to
grep
for processing. But there is a problem with
this approach. The readdir
function simply
returns the name of the file and not the full path, which means that
we have to construct a full path before we can pass it to the
-T
operator. We use the
$DOCUMENT_ROOT
variable to create the full
path to the file.
The -T
operator returns true if the file is a
text file. After grep
finishes processing all
the files, @files
will contain a list of all the
text files.
We iterate through the @files
array, setting
$file
to the current value each time through the
loop. We proceed to open the file, making sure to return an error if
we cannot open it, and iterate through it one line at a time.
The %matches
hash contains three elements:
file to store the name of the file,
num_matches to store the number of matches, and
a content array to hold all the lines containing
matches. We need the filename for output purposes.
We use a simple case-insensitive
regex to search for the query. The
o
option compiles the regex only once, which
greatly improves the speed of the search. Note that this will cause
problems for scripts running under mod_perl or
FastCGI, which we’ll discuss later in Chapter 17.
If the line contains a match, we escape characters that could be mistaken for HTML tags. We then bold the matched text, increment the match counter by the number of matches, and push that line onto that file’s content array.
After we have finished looking through the files, we sort the results by the number of matches found in decreasing order and then alphabetically by path for those who have the same number of matches.
To generate our results, we walk through our sorted list. For each file, we create a link and display the number of matches and all the lines that matched the query. Since the content exists as individual elements in an array, we join all the elements together into one large string delimited by an HTML break tag.
Now, let us improve on this application a bit by allowing users to specify regular expression searches. We will not present the entire application, since it is very similar to the one we have just covered.
By allowing users to specify regular expressions in their search, we make the search engine much more powerful. For example, a user who wants to search for the recipe for Zwetschgendatschi (a Bavarian plum cake) from your online collection, but is not sure of the exact spelling, could simply enter Zwet.+?chi to find it.
In order to implement this functionality, we have to add several pieces to the search engine.
First, we need to modify the HTML file to provide an option for the user to turn the functionality on or off:
Regex Searching: <INPUT TYPE="radio" NAME="regex" VALUE="on">On <INPUT TYPE="radio" NAME="regex" VALUE="off" CHECKED>Off
Then, we need to check for this value in the application and act accordingly. Here is the beginning of the new search script:
#!/usr/bin/perl -wT use strict; my $q = new CGI; my $regex = $q->param( "regex" ); my $query = $q->param( "query" ); unless ( defined $query and length $query ) { error( $q, "Please specify a query!" ); } if ( $regex eq "on" ) { eval { /$query/o }; error( $q, "Invalid Regex") if $@; } else { $query = quotemeta $query; } my $results = search( $q, $query ); print $q->header( "text/html" ), $q->start_html( "Simple Perl Regex Search" ), $q->h1( "Search for: $query" ), $q->ul( $results || "No matches found" ), $q->end_html; . .
The rest of the code remains the same. What we are doing differently
here is checking if the user chose the “regex” option and
if so, evaluating the user-specified regex at runtime using the
eval
function. We can check
to see whether the regex is invalid by looking at the value stored in
$@
. Perl sets this variable if there is an error
in the evaluated code. If the regex is valid, we can go ahead and use
it directly, without quoting the specified metacharacters. If the
“regex” option was not requested, we perform the search
as before.
As you can see, both of these applications are much improved over the first one, but neither one of them is perfect. Since both of them are based on a linear search algorithm, the search process will be slow when dealing with directories that contain many files. They also search only one directory. They could be modified to recurse down through subdirectories, but that would decrease the performance even more. In the next section, we will look at an index-based approach that calls for creating a dictionary of relevant words in advance, and then searching it rather than the actual files.
The applications that we’ve looked at so far search through each and every file in the specified directory, looking for particular words or phrases. This is not only time consuming, but will also place a great burden on the server. We clearly need a different approach to searching.
A more efficient approach is to create an index (like the one you can find at the back of this and other books) containing all the words from specific documents and the name of the document in which they appear.
In this section, we will discuss an application that creates an inverted index. The index is inverted in the sense that a particular word is used to find the file(s) in which it appears, rather than the other way around. In the following section, we will look at the CGI script that searches this index and presents the results in a nice format.
Example 12.3 creates the indexer.
Example 12-3. indexer.pl
#!/usr/bin/perl -wT # This is not a CGI, so taint mode not required use strict; use File::Find; use DB_File; use Getopt::Long; require "stem.pl"; use constant DB_CACHE => 0; use constant DEFAULT_INDEX => "/usr/local/apache/data/index.db"; my( %opts, %index, @files, $stop_words ); GetOptions( \%opts, "dir=s", "cache=s", "index=s", "ignore", "stop=s", "numbers", "stem" ); die usage( ) unless $opts{dir} && -d $opts{dir}; $opts{'index'} ||= DEFAULT_INDEX; $DB_BTREE->{cachesize} = $cache || DB_CACHE; $index{"!OPTION:stem"} = 1 if $opts{'stem'}; $index{"!OPTION:ignore"} = 1 if $opts{'ignore'}; tie %index, "DB_File", $opts{'index'}, O_RDWR|O_CREAT, 0640, $DB_TREE or die "Cannot tie database: $! "; find( sub { push @files, $File::Find::name }, $opts{dir} ); $stop_words = load_stopwords( $opts{stop} ) if $opts{stop}; process_files( \%index, @files, \%opts, $stop_words ); untie %index; sub load_stopwords { my $file = shift; my $words = {}; local *INFO, $_; die "Cannot file stop file: $file " unless -e $file; open INFO, $file or die "$! "; while ( <INFO> ) { next if /^#/; $words->{lc $1} = 1 if /(S+)/; } close INFO; return $words; } sub process_files { my( $index, $files, $opts, $stop_words ) = @_; local *FILE, $_; local $/ = " "; for ( my $file_id = 0; $file_id < @$files; $file_id++ ) { my $file = $files[$file_id]; my %seen_in_file; next unless -T $file; print STDERR "Indexing $file "; $index->{"!FILE_NAME:$file_id"} = $file; open FILE, $file or die "Cannot open file: $file! "; while ( <FILE> ) { tr/A-Z/a-z/ if $opts{ignore}; s/<.+?>//gs; # Woa! what about < or > in comments or js?? while ( /([a-zd]{2,})/gi ) { my $word = $1; next if $stop_words->{lc $word}; next if $word =~ /^d+$/ && not $opts{number}; ( $word ) = stem( $word ) if $opts{stem}; $index->{$word} = ( exists $index->{$word} ? "$index->{$word}:" : "" ) . "$file_id" unless $seen_in_file{$word}++; } } } } sub usage { my $usage = <<End_of_Usage; Usage: $0 -dir directory [options] The options are: -cache DB_File cache size (in bytes) -index Path to index, default:/usr/local/apache/data/index.db -ignore Case-insensitive index -stop Path to stopwords file -numbers Include numbers in index -stem Stem words End_of_Usage return $usage; }
We will use File::Find to get a list of all the files in the specified directory, as well as files in any subdirectories. The File::Basename module provides us with functions to extract the filename, given the full path. You might be wondering why we need this feature, considering the fact that we can use a simple regular expression to get at the filename. If we use a regex, we will have to determine what platform we’re using the application on, and accordingly extract the filename. This module takes care of that for us.
We use DB_File to create and store the index. Note that we could also store the index in an RDBMS, although a DBM file is certainly adequate for many sites. The method for creating indexes is the same no matter what type of format we use for storage. Getopt::Long helps us handle command-line options, and stem.pl , a Perl 4 library, has algorithms to automatically “stem” (or remove) word suffixes.
We use the
DB_CACHE
constant to hold the size of the
DB_File memory cache. Increasing the size of
this cache (up to a certain point) improves insertion rate at the
expense of memory. In other words, it increases the rate at which we
store the words in the index. A cache size of
is used as the default.
DEFAULT_INDEX
contains the default path to the file that will hold our data. The
user can specify a different file by using the
-index
option, as you will see shortly.
The GetOptions function (part of the Getopt::Long module) allows us to extract any command-line options and store them in a hash. We pass a reference to a hash and a list of options to GetOptions. The options that take arguments contain an “s” to indicate that they each take a string.
This application allows you to pass several options that will affect the indexing process. The -dir option is the only one that is required, as it is used to specify the directory that contains the files to be indexed.
You can use the -cache option to specify the cache size and -index to specify the path to the index. The -ignore option creates an index where all the words are turned into lowercase (case-insensitive). This will increase the rate at which the index is created, as well as decrease the size of the index. If you want numbers in documents to be included in the index, you can specify the -numbers option.
You can use the -stop option to specify a file that contains “stop” words—words that are generally found in most of your documents. Typical stop words include “a”, “an”, “to”, “it”, and “the”, but you can also include words that are more specific to your documents.
Finally, the -stem option stems word suffixes before storing them in the index. This will help us find words in documents much easily. For example, if a user searches for “tomatoes”, our search application will return documents that contain “tomato” as well as “tomatoes”. An important note here is that stemming will also create a case-insensitive index.
Here’s an example of how you would use these various options:
$ ./Indexer -dir /usr/local/apache/htdocs/sports -cache 16_000_000 -index /usr/local/apache/data/sports.db -stop my_stop_words.txt -stem
%index
is the hash that will hold the index. We
use the tie
function to bind the hash to the file
specified by $opts{index}
. This allows us to
transparently store a hash in a file, which we can later retrieve and
modify. In this example, we are using DB_File, as it is faster and
more efficient that other DBM implementations.
If the -stem option was used, we record this in our index so that our CGI script knows whether to apply stemming to the query as well. We could have stored this information in another database file, but that would require opening two files for each search. Instead, we name this key with an exclamation point such that it can’t collide with any of the words we’re indexing.
We use the find
function (part of File::Find module)
to get a list of all the files in the specified directory.
find expects the first argument to be a code
reference, which can either be a reference to a subroutine or an
inlined anonymous subroutine, as is the case above. As
find traverses through the directory (as well as
all subdirectories), it executes the code, specified by the first
argument, setting the $File::Find::name
variable
to the path of the file. This builds an array of the path to all the
files under the original directory.
If a stop file was specified and it exists, we call the
load_stopwords
function to read through the file
and return a reference to a hash.
The most important function in this application is
process_files, which iterates through all the
files and stores the words in $index
. Finally, we
close the binding between the hash and the file and exit. At this
point, we will have a file containing the index.
Let’s look at the functions now. The
load_stopwords
function opens the stop words
file, ignores all comments (lines starting with “#”), and
extracts the first word found on each line (S+)
.
The word is converted to lowercase by the lc
function and stored as a key in the hash referenced by
$words
. Since we are going to find words with
mixed case in our files, it is much easier and quicker to compare
them to this list if all our stop words are either completely
uppercase or completely lowercase.
Before we discuss the process_files
method,
let’s look at the arguments it expects. The first argument,
$index
, is a reference to an empty hash that will
eventually contain the words from all the files as well as pointers
to the documents where they are found. $files
is a
reference to a list of all the files to parse.
$stop
is a reference to a hashes containing our
stop words. The final argument, $args
, is simply a
reference to the hash of our command-line arguments.
If the user chose to ignore case, we convert all words into lowercase, thus creating a case-insensitive index.
We set
Perl’s default input record
separator, $/
, to paragraph mode. In other words,
one read on a file handle will return a paragraph, as opposed to a
single line. This allows us to index the files at a faster rate.
We iterate through the @$files
array with the
for
function, storing the key in
$file_id
and the value of the current file in
$file
. Since this application creates a
human-searchable index, we will deal only with text files. We use the
-T
operator to ignore any non-text
files.
The first entry into the %$index
hash is a
“unique” key that associates a number with the full path
to the file. Since this hash will also hold all the words that we
find, we use the “!FILE_NAME” string to keep our number
to file mappings separate from the words.
We start our indexing process by iterating through the file a
paragraph at a time; the $_
variable holds the
contents. If the -case option was specified by
the user, we convert the paragraph that we have just read to
lowercase.
We also strip all HTML tags from the paragraph, since we don’t want them to be indexed. The regexp will look for a string starting with “<”, followed by one or more characters (including newlines) until it finds the first “>”.
We iterate through the paragraph using a
regex that extracts words greater than
or equal to two characters in length and matches characters as well
as digits (d matches “0-9”). The matched word is stored
in $1
.
Before we check to see if the word we extracted is a stop word, we need to convert it to lowercase, since we converted all the stop words to lowercase earlier in this script. If the word is, indeed, a stop word, we skip it and continue. We also skip numbers if the -numbers option is not specified.
If the -stem option is specified, we call the stem function (part of the stem.pl library) to remove all prefixes from the word and convert it to lowercase.
Finally, we are ready to store the word in the index, where the value
represents the file that we are currently parsing. Unfortunately,
this isn’t that simple. The last command is a little long and
complicated. It helps to read it backwards. First, we check whether
we have seen the
word in this file
previously by using the %seen_in_file
hash; the
first time through, there will not be an entry in the hash and will
evaluate to false (and thus pass the unless
check), thereafter, it will contain the number of times we have seen
the number in the file and evaluate to true (and thus fail the
unless
check). So the first time we see the word
in the file, we add it to our index. If the word was previously
indexed for another file, then we join the
$file_id
of this file to the previous entry with a
colon. Otherwise, we just add $file_id
as this
word’s only value thus far.
When this function finishes, the %$index
hash will
look something like this:
$index = { "!FILE_NAME:1" => "/usr/local/apache/htdocs/sports/sprint.html", "!FILE_NAME:2" => "/usr/local/apache/htdocs/sports/olympics.html", "!FILE_NAME:3" => "/usr/local/apache/htdocs/sports/celtics.html", browser => "1:2", code => "3", color => "2:3", comment => "2", content => "1", cool => "2:3", copyright => "1:2:3" };
Now, we are ready to implement the CGI application that will search this index.
The indexer application makes our life easier when it comes time to write the CGI application to perform the actual search. The CGI application should parse the form input, open the DBM file created by the indexer, search for possible matches and then return HTML output.
Example 12.4 contains the program.
Example 12-4. indexed_search.cgi
#!/usr/bin/perl -wT use DB_File; use CGI; use CGIBook::Error; use File::Basename; require stem.pl; use strict; use constant INDEX_DB => "/usr/local/apache/data/index.db"; my( %index, $paths, $path ); my $q = new CGI; my $query = $q->param("query"); my @words = split /s*(,|s+)/, $query; tie %index, "DB_File", INDEX_DB, O_RDONLY, 0640 or error( $q, "Cannot open database" ); $paths = search( \%index, @words ); print $q->header, $q->start_html( "Inverted Index Search" ), $q->h1( "Search for: $query" ); unless ( @$paths ) { print $q->h2( $q->font( { -color => "#FF000" }, "No Matches Found" ) ); } foreach $path ( @$paths ) { next unless $path =~ s/^Q$ENV{DOCUMENT_ROOT}E//o; $path = to_uri_path( $path ); print $q->a( { -href => "$path" }, "$path" ), $q->br; } print $q->end_html; untie %index; sub search { my( $index, $words ) = @_; my $do_stemming = exists $index->{"!OPTION:stem"} ? 1 : 0; my $ignore_case = exists $index->{"!OPTION:ignore"} ? 1 : 0; my( %matches, $word, $file_index ); foreach $word ( @$words ) { my $match; if ( $do_stemming ) { my( $stem ) = stem( $word ); $match = $index->{$stem}; } elsif ( $ignore_case ) { $match = $index->{lc $word}; } else { $match = $index->{$word}; } next unless $match; foreach $file_index ( split /:/, $match ) { my $filename = $index->{"!FILE_NAME:$file_index"}; $matches{$filename}++; } } my @files = map { $_->[0] } sort { $matches{$a->[0]} <=> $matches{$b->[0]} || $a->[1] <=> $b->[1] } map { [ $_, -M $_ ] } keys %matches; return @files; } sub to_uri_path { my $path = shift; my( $name, @elements ); do { ( $name, $path ) = fileparse( $path ); unshift @elements, $name; chop $path; } while $path; return join '/', @elements; }
The modules should be familiar to you by now. The
INDEX_DB
constant
contains the
path
of the index created by the indexer application.
Since a query can include multiple words, we split it on any
whitespace or a comma and store the resulting words in the
@words
array. We use
tie
to open the index DBM file in
read-only mode. In other words, we bind the
index file with the
%index
hash. If we cannot open the file, we call
our error function to return an error to the
browser.
The real searching is done appropriately enough in the
search
function, which takes a reference to the
index hash and a reference to the list of words we are searching for.
The first thing we do is to peek into the index and see if the stem
option was set when the index was built. We then proceed to iterate
through the @$words
array, searching for possible
matches. If stemming was enabled, we stem the word and compare that.
Otherwise, we check to see whether the particular word exists in the
index as-is, or as a lowercase word if the index is not
case-sensitive. If any of these comparisons succeeds, we have got a
match. Otherwise, we ignore the word and continue.
If there is a match, we split the
colon separated list
of file id’s where that particular word is found. Since we
don’t want duplicate entries in our final list, we store the
full path of the matching files in the %matches
hash.
After the loop has finished executing, we are left with the matching
files in %matches
. We would like to add some order
to our results and display them according to the number of words
matching and then by the
file’s modification time. So, we
sort the keys according to the number of matches and then by the data
returned by the -M
operator, and store the recently
modified files in the @files
array.
We could calculate the modification time of the files during each comparison like this:
my @files = sort { $matches{$_} <=> $matches{$_} || -M $_ <=> -M $_ } keys %matches;
However, this is inefficient because we might calculate the modification time for each file multiple times. A more efficient algorithm involves precalculating the modification times as we have done in the program.
This strategy has become known as the Schwartzian Transform, made famous by Randal Schwartz. It’s beyond the scope of this book to explain this, but if you’re interested, see Joseph Hall’s explanation of the Transform, located at: http://www.5sigma.com/perl/schwtr.html. Ours is a slight variation because we perform a two-part sort.
We output the HTTP and HTML document headers, and proceed to check to
see if we have any matches. If not, we return a simple message.
Otherwise, we iterate through the @files
array,
setting $path
to the current element each time
through the loop. We strip off the part of the path that matches the
server’s root directory. That should give us the
path that
corresponds to a URL. However, on non-Unix filesystems, we
won’t have
forward slashes (“/”)
separating directories. So we call the
to_uri_path
function, which uses the File::Basename
module to strip off successive elements of the path and then rebuild
it with forward slashes. Note that this will work on many operating
systems like Win32 and MacOS, but it will not work on systems that do
not use a single character to delimit parts of the path (like VMS;
although, the chances that you’re actually doing CGI
development on a VMS machine are pretty slim).
We build proper links with this newly formatted path, print the remainder of our results, close the binding between the database and the hash, and exit.