Date-Range Searching with a Client-Side Application

Monitor a set of queries for new finds added to the Google index yesterday.

The GooFresh [Hack #42] hack is a simple web form-driven CGI script for building date range [Hack #11] Google queries. A simple web-based interface is fine when you want to search for only one or two items at a time. But what of performing multiple searches over time, saving the results to your computer for comparative analysis?

A better fit for this task is a client-side application that you run from the comfort of your own computer’s desktop. This Perl script feeds specified queries to Google via the Google Web API, limiting results to those indexed yesterday. New finds are appended to a comma-delimited text file per query, suitable for import into Excel or your average database application.

Tip

This hack requires an additional Perl module, Time::JulianDay (http://search.cpan.org/author/MUIR/); it just won’t work until you have the module installed.

The Queries

First, you’ll need to prepare a few queries to feed the script. Try these out via the Google search interface itself first to make sure you’re receiving the kind of results you’re expecting. Your queries can be anything you’d be interested in tracking over time: topics of long-lasting or current interest, searches for new directories of information [Hack #21] coming online, unique quotes from articles or other sources that you want to monitor for signs of plagiarism.

Use whatever special syntaxes you like except for link:; as you might remember, link: can’t be used in concert with any other special syntax like daterange:, upon which this hack relies. If you insist on trying anyway (e.g., link:www.yahoo.com daterange:2452421-2452521), Google will simply treat link as yet another query word (e.g., link www.yahoo.com), yielding some unexpected and useless results.

Put each query on its own line. A sample query file will look something like this:

"digital archives"
intitle:"state library of"
intitle:directory intitle:resources
"now * * time for all good men * come * * aid * * party"

Save the text file somewhere memorable; alongside the script you’re about to write is as good a place as any.

The Code

#!/usr/local/bin/perl -w
# goonow.pl
# feeds queries specified in a text file to Google, querying
# for recent additions to the Google index.  The script appends
# to CSV files, one per query, creating them if they don't exist.
# usage: perl goonow.pl [query_filename]

# My Google API developer's key
my $google_key='insert key here';

# Location of the GoogleSearch WSDL file
my $google_wdsl = "./GoogleSearch.wsdl";

use strict;

use SOAP::Lite;
use Time::JulianDay;

$ARGV[0] or die "usage: perl goonow.pl [query_filename]
";

my $julian_date = int local_julian_day(time) - 2;

my $google_search  = SOAP::Lite->service("file:$google_wdsl");

open QUERIES, $ARGV[0] or die "Couldn't read $ARGV[0]: $!";

while (my $query = <QUERIES>) {
  chomp $query;
  warn "Searching Google for $query
";
   
  $query .= " daterange:$julian_date-$julian_date";
  (my $outfile = $query) =~ s/W/_/g;
  open (OUT, ">> $outfile.csv")
    or die "Couldn't open $outfile.csv: $!
";
   
  my $results = $google_search ->
    doGoogleSearch(
      $google_key, $query, 0, 10, "false", "",  "false",
      "", "latin1", "latin1"
    );
  foreach (@{$results->{'resultElements'}}) {
    print OUT '"' . join('","', (
      map {
        s!
!!g; # drop spurious newlines
        s!<.+?>!!g; # drop all HTML tags
        s!"!""!g; # double escape " marks
        $_;
      } @$_{'title','URL','snippet'}
    ) ) . ""
";
  }
}

You’ll notice that GooNow checks the day before yesterday’s rather than yesterday’s additions (my $julian_date = int local_julian_day(time) - 2;). Google indexes some pages very frequently; these show up in yesterday’s additions and really bulk up your search results. So if you search for yesterday’s results, in addition to updated pages you’ll get a lot of noise, pages that Google indexes every day, rather than the fresh content you’re after. Skipping back one more day is a nice hack to get around the noise.

Running the Hack

This script is invoked on the command line like so:

$ perl goonow.pl 
               query_filename

Where query_filename is the name of the text file holding all the queries to be fed to the script. The file can be located either in the local directory or elsewhere; if the latter, be sure to include the entire path (e.g., /mydocu~1/hacks/queries.txt).

Bear in mind that all output is directed to CSV files, one per query. So don’t expect any fascinating output on the screen.

The Results

Taking a quick look at one of the CSV output files created, intitle_ _state_library_of_.csv:

"State Library of Louisiana","http://www.state.lib.la.us/"," ...
Click
here if you have any questions or comments. Copyright <C2><A9>
1998-2001 State Library of Louisiana Last modified: August 07,
2002. "
"STATE LIBRARY OF NEW SOUTH WALES, SYDNEY
AUSTRALIA","http://www.slnsw.gov.au/", " ... State Library of New
South
Wales Macquarie St, Sydney NSW Australia 2000 Phone: +61 2 9273
1414
Fax: +61 2 9273 1255. Your comments You could win a prize! ...  "
"State Library of Victoria","http://www.slv.vic.gov.au/"," ...
clicking
on our logo. State Library of Victoria Logo with link to homepage
State
Library of Victoria. A world class cultural resource ...  "
...

Hacking the Hack

The script keeps appending new finds to the appropriate CVS output file. If you wish to reset the CVS files associated with particular queries, simply delete them and the script will create them anew.

Or you can make one slight adjustment to have the script create the CSV files anew each time, overwriting the previous version, like so:

...
(my $outfile = $query) =~ s/W/_/g;
open (OUT, "> $outfile.csv")
  or die "Couldn't open $outfile.csv: $!
";
my $results = $google_search ->
  doGoogleSearch(
    $google_key, $query, 0, 10, "false", "", "false",
    "", "latin1", "latin1"
  );
...

Notice the only change in the code is the removal of one of the > characters when the output file is created—open (OUT, "> $outfile.csv") instead of open (OUT, ">> $outfile.csv").

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset