Summarizing Results by Domain

Getting an overview of the sorts of domains (educational, commercial, foreign, and so forth) found in the results of a Google query.

You want to know about a topic, so you do a search. But what do you have? A list of pages. You can’t get a good idea of the types of pages these are without taking a close look at the list of sites.

This hack is an attempt to get a “snapshot” of the types of sites that result from a query. It does this by taking a "suffix census,” a count of the different domains that appear in search results.

This is most ideal for running link: queries, providing a good idea of what kinds of domains (commercial, educational, military, foreign, etc.) are linking to a particular page.

You could also run it to see where technical terms, slang terms, and unusual words were turning up. Which pages mention a particular singer more often? Or a political figure? Does the word “democrat” come up more often on .com or .edu sites?

Of course this snapshot doesn’t provide a complete inventory; but as overviews go, it’s rather interesting.

The Code

#!/usr/local/bin/perl
# suffixcensus.cgi
# Generates a snapshot of the kinds of sites responding to a
# query.  The suffix is the .com, .net, or .uk part.
# suffixcensus.cgi is called as a CGI with form input

# Your Google API developer's key
my $google_key='insert key here';

# Location of the GoogleSearch WSDL file
my $google_wdsl = "./GoogleSearch.wsdl";

# Number of times to loop, retrieving 10 results at a time
my $loops = 10;

use SOAP::Lite;
use CGI qw/:standard *table/;

print
  header(  ),
  start_html("SuffixCensus"),
  h1("SuffixCensus"),
  start_form(-method=>'GET'),
  'Query: ', textfield(-name=>'query'),
  '   ',
  submit(-name=>'submit', -value=>'Search'),
  end_form(  ), p(  );

if (param('query')) {
  my $google_search  = SOAP::Lite->service("file:$google_wdsl");
  my %suffixes;

  for (my $offset = 0; $offset <= $loops*10; $offset += 10) {

    my $results = $google_search ->
      doGoogleSearch(
        $google_key, param('query'), $offset, 10, "false", "",  "false",
        "", "latin1", "latin1"
      );
      
    last unless @{$results->{resultElements}};

    map { $suffixes{ ($_->{URL} =~ m#://.+?.(w{2,4})/#)[0] }++ }
      @{$results->{resultElements}};
  }

  print
    h2('Results: '), p(  ),
    start_table({cellpadding => 5, cellspacing => 0, border => 1}),
    map( { Tr(td(uc $_),td($suffixes{$_})) } sort keys %suffixes ),
    end_table(  );
}

print end_html(  );

Running the Hack

This hack runs as a CGI script. Install it in your cgi-bin or appropriate directory, and point your browser at it.

The Results

Searching for the prevalence of "soda pop" by suffix finds, as one might expect, the most mention on .coms, as Figure 6-19 shows.

Prevalence of “soda pop” by suffix

Figure 6-19. Prevalence of “soda pop” by suffix

Hacking the Hack

There are a couple of ways to hack this hack.

Going back for more

This script, by default, visits Google 10 times, grabbing the top 100 (or fewer, if there aren’t as many) results. To increase or decrease the number of visits, simply change the value of the $loops variable at the top of the script. Bear in mind, however, that making $loops = 50 might net you 500 results, but you’re also eating quickly into your daily alotment of 1,000 Google API queries.

Comma-separated

It’s rather simple to adjust this script to run from the command line and return a comma-separated output suitable for Excel or your average database. Remove the starting HTML, form, and ending HTML output, and alter the code that prints out the results. In the end, you come to something like this (changes in bold):

#!/usr/local/bin/perl
# suffixcensus_csv.pl
# Generates a snapshot of the kinds of sites responding to a
# query.  The suffix is the .com, .net, or .uk part.
# usage: perl suffixcensus_csv.pl query="your query" > results.csv

# Your Google API developer's key
my $google_key='insert key';

# Location of the GoogleSearch WSDL file
my $google_wdsl = "./GoogleSearch.wsdl";

# Number of times to loop, retrieving 10 results at a time
my $loops = 1;

use SOAP::Lite;
use CGI qw/:standard/;

param('query')
                  or die qq{usage: suffixcensus_csv.pl query="{query}" [> results.csv]
};

                  print qq{"suffix","count"
};
 
my $google_search  = SOAP::Lite->service("file:$google_wdsl");

my %suffixes;

for (my $offset = 0; $offset <= $loops*10; $offset += 10) {

  my $results = $google_search ->
    doGoogleSearch(
      $google_key, param('query'), $offset, 10, "false", "",  "false",
      "", "latin1", "latin1"
    );

  last unless @{$results->{resultElements}};

  map { $suffixes{ ($_->{URL} =~ m#://.+?.(w{2,4})/#)[0] }++ }
    @{$results->{resultElements}};
}

print map { qq{"$_", "$suffixes{$_}"
} } sort keys %suffixes;

Invoke the script from the command line like so:

$ perl suffixcensus_csv.pl query="
                  query
                  " > results.csv

Searching for mentions of “colddrink,” the South African version of “soda pop,” sending the output straight to the screen rather than a results.csv file, looks like this:

$ perl suffixcensus_csv.pl query="colddrink" 
"suffix","count"
"com", "12"
"info", "1"
"net", "1"
"za", "6"
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset