Monitor a set of queries for new finds added to the Google index yesterday.
The GooFresh [Hack #42] hack is a simple web form-driven CGI script for building date range [Hack #11] Google queries. A simple web-based interface is fine when you want to search for only one or two items at a time. But what of performing multiple searches over time, saving the results to your computer for comparative analysis?
A better fit for this task is a client-side application that you run from the comfort of your own computer’s desktop. This Perl script feeds specified queries to Google via the Google Web API, limiting results to those indexed yesterday. New finds are appended to a comma-delimited text file per query, suitable for import into Excel or your average database application.
This hack requires an additional Perl module, Time::JulianDay (http://search.cpan.org/author/MUIR/); it just won’t work until you have the module installed.
First, you’ll need to prepare a few queries to feed the script. Try these out via the Google search interface itself first to make sure you’re receiving the kind of results you’re expecting. Your queries can be anything you’d be interested in tracking over time: topics of long-lasting or current interest, searches for new directories of information [Hack #21] coming online, unique quotes from articles or other sources that you want to monitor for signs of plagiarism.
Use whatever special syntaxes you like except for
link:
; as you might remember,
link:
can’t be used in concert
with any other special syntax like daterange:
,
upon which this hack relies. If you insist on trying anyway (e.g.,
link:www.yahoo.com
daterange:2452421-2452521
), Google will simply
treat link
as yet another query word (e.g.,
link
www.yahoo.com
), yielding
some unexpected and useless results.
Put each query on its own line. A sample query file will look something like this:
"digital archives" intitle:"state library of" intitle:directory intitle:resources "now * * time for all good men * come * * aid * * party"
Save the text file somewhere memorable; alongside the script you’re about to write is as good a place as any.
#!/usr/local/bin/perl -w # goonow.pl # feeds queries specified in a text file to Google, querying # for recent additions to the Google index. The script appends # to CSV files, one per query, creating them if they don't exist. # usage: perl goonow.pl [query_filename] # My Google API developer's key my $google_key='insert key here'; # Location of the GoogleSearch WSDL file my $google_wdsl = "./GoogleSearch.wsdl"; use strict; use SOAP::Lite; use Time::JulianDay; $ARGV[0] or die "usage: perl goonow.pl [query_filename] "; my $julian_date = int local_julian_day(time) - 2; my $google_search = SOAP::Lite->service("file:$google_wdsl"); open QUERIES, $ARGV[0] or die "Couldn't read $ARGV[0]: $!"; while (my $query = <QUERIES>) { chomp $query; warn "Searching Google for $query "; $query .= " daterange:$julian_date-$julian_date"; (my $outfile = $query) =~ s/W/_/g; open (OUT, ">> $outfile.csv") or die "Couldn't open $outfile.csv: $! "; my $results = $google_search -> doGoogleSearch( $google_key, $query, 0, 10, "false", "", "false", "", "latin1", "latin1" ); foreach (@{$results->{'resultElements'}}) { print OUT '"' . join('","', ( map { s! !!g; # drop spurious newlines s!<.+?>!!g; # drop all HTML tags s!"!""!g; # double escape " marks $_; } @$_{'title','URL','snippet'} ) ) . "" "; } }
You’ll notice that GooNow checks the day before
yesterday’s rather than yesterday’s
additions (my
$julian_date
=
int
local_julian_day(time)
-
2;
). Google indexes some pages very frequently;
these show up in yesterday’s additions and really
bulk up your search results. So if you search for
yesterday’s results, in addition to updated pages
you’ll get a lot of noise, pages that Google indexes
every day, rather than the fresh content you’re
after. Skipping back one more day is a nice hack to get around the
noise.
This script is invoked on the command line like so:
$ perl goonow.pl query_filename
Where query_filename
is the name of the
text file holding all the queries to be fed to the script. The file
can be located either in the local directory or elsewhere; if the
latter, be sure to include the entire path (e.g.,
/mydocu~1/hacks/queries.txt
).
Bear in mind that all output is directed to CSV files, one per query. So don’t expect any fascinating output on the screen.
Taking a quick look at one of the CSV output files created,
intitle_ _state_library_of_.csv
:
"State Library of Louisiana","http://www.state.lib.la.us/"," ... Click here if you have any questions or comments. Copyright <C2><A9> 1998-2001 State Library of Louisiana Last modified: August 07, 2002. " "STATE LIBRARY OF NEW SOUTH WALES, SYDNEY AUSTRALIA","http://www.slnsw.gov.au/", " ... State Library of New South Wales Macquarie St, Sydney NSW Australia 2000 Phone: +61 2 9273 1414 Fax: +61 2 9273 1255. Your comments You could win a prize! ... " "State Library of Victoria","http://www.slv.vic.gov.au/"," ... clicking on our logo. State Library of Victoria Logo with link to homepage State Library of Victoria. A world class cultural resource ... " ...
The script keeps appending new finds to the appropriate CVS output file. If you wish to reset the CVS files associated with particular queries, simply delete them and the script will create them anew.
Or you can make one slight adjustment to have the script create the CSV files anew each time, overwriting the previous version, like so:
...
(my $outfile = $query) =~ s/W/_/g;
open (OUT, "> $outfile.csv")
or die "Couldn't open $outfile.csv: $!
";
my $results = $google_search ->
doGoogleSearch(
$google_key, $query, 0, 10, "false", "", "false",
"", "latin1", "latin1"
);
...
Notice the only change in the code is the removal of one of the
>
characters when the output file is
created—open
(OUT,
">
$outfile.csv")
instead of
open (OUT, ">> $outfile.csv")
.