Chapter 25. Correcting Typos with Perl

Dave Cross

Symbol::Approx::Sub is a Perl module that allows us to call subroutines even if we spell their names wrong. Using it can be as simple as adding this to your programs:

use Symbol::Approx::Sub;

Once we’ve done this, we never have to worry about spelling our subroutine names correctly again. For example, this program prints This is the foo subroutine!, even though &foo was misspelled as &few.

use Symbol::Approx::Sub;

sub foo {
    print "This is the foo subroutine!
";
}

&few;

Why Was It Written?

This is obviously a very dangerous thing to want, so what made me decide to write Symbol::Approx::Sub?

In July 2000 I attended the O’Reilly Perl Conference and took Mark Jason Dominus’s “Tricks of the Wizards” tutorial. He explained a number of concepts that can take your Perl programs to a new level of complexity and elegance. The most important of these concepts are typeglobs and the AUTOLOAD function. It was the first time that I’d really tried to understand either of these concepts and, thanks to Dominus’s clear explanations, I began to understand their power.

One example that Dominus used in this class was a demonstration of how we can use AUTOLOAD to catch misspelled subroutine names and perhaps do something about it. He showed a slide containing code like this:

sub AUTOLOAD {
    my ($sub) = s/.*::(.*)/;
    # Work out what sub the user really meant
    $sub = get_real_name_of_sub($sub);

    goto &$sub;
}

On the following slide, he went into some detail about what a really bad idea this would be and how it would make your code completely unmaintainable. But it was too late; I was already thinking about how I could write a “get the real name of the subroutine” function and put it into a module that could be used in any Perl program.

How Does It Work?

During the twelve-hour flight home from California to England I thrashed out the implementation details. Here are the four required stages:

  • When the module is loaded, it needs to install an AUTOLOAD function in the package that called it.

  • When AUTOLOAD is called (as the result of invoking a non-existent subroutine) it needs to get a list of all the subroutines in our calling package.

  • The AUTOLOAD function needs to compare each of those subroutine names with what the user actually called, and choose the most likely candidate.

  • It then invokes the chosen subroutine.

The key to the first two stages was the other main topic of Dominus’s talk—typeglobs.

In every Perl package, there is something called a stash (“symbol table hash”) that contains the package’s variables and subroutines. The stash is like a normal hash, with the keys being the names of the variables, and the values being references to the typeglobs. A typeglob is a data structure containing references to all of the objects with the same name. We know that in a Perl program you can have $a, @a, %a, and &a, and they are all completely separate—but they all live in the same typeglob.

The first stage is achieved with a useful typeglob trick. We can assign values (which should be references) to the various slots of a typeglob. This has the effect of aliasing the typeglob’s name to the referenced value. For example, if we execute the following line of code:

*a = @array_with_a_really_long_name;

@a will become an alias to @array_with_a_really_long_name and any changes we make to to @a will actually happen to the other array.

Furthermore, we can do this with any typeglob object, not just arrays. In particular, we can do it with subroutines, which is what I needed for Symbol::Approx::Sub. The two objects don’t even have to be in the same package, as we can see from the following code:

package other;

sub foo { print "This is &other::foo
" }

*main::bar = &foo;

package main;

&bar;

In this example we create a subroutine called foo in the package other. We then alias that subroutine to &main::bar. This means that within the main package, if we call bar we actually call &other::foo. (This is how the Exporter module works.)

When Symbol::Approx::Sub is loaded, we alias our caller’s AUTOLOAD function to the one in our module. We know what our AUTOLOAD needs to do, but how do we get a list of subroutines in the calling package?

Let’s look at a simple typeglob example. The next piece of code declares three package variables and a subroutine. We then write a simple foreach loop to print out the contents of %main::stash. If we run this program we’ll see the names of our package objects a, b, c, and d. (We’ll also see the standard filehandles STDIN and STDOUT and other built-in Perl variables like @INC and %ENV.)

use vars qw($a @b %c);

sub d { print "Hello, world!
" };

foreach (keys %main::) {
    print "$_
";
}

Having listed the typeglobs, our next task is to work out which of them contain subroutines. For this, we can use the *FOO{ THING} syntax. In the same way that scalar names always start with a $ and array names always start with a @, typeglob names always start with a *. *FOO therefore refers to the typeglob called FOO (which will contain $FOO, @FOO, %FOO, and &FOO). With the *FOO{ THING} syntax, we can find out whether the typeglob FOO contains an object of type THING, where THING can be SCALAR, ARRAY, HASH, IO, FORMAT, CODE, or GLOB. The next piece of code uses this syntax to show which of the typeglobs in our current package contain a subroutine:

#!/usr/bin/perl -w

use strict;
use vars qw($a @b %c);

sub d { print "sub d" };

while (my ($name, $glob) = each %main::) {
    print "$name contains a sub
" if defined *$glob{CODE};
}

We now know enough to create an AUTOLOAD function that generates a list of the subroutines that exist in the package.

Inside the AUTOLOAD function, the name of the subroutine that the program attempted to invoke will be available in the $AUTOLOAD variable. All we need to do is carry out some sort of fuzzy matching on the set of subroutine names and the misspelled subroutine name to find the best match.

Unfortunately, this isn’t as simple as it sounds. I didn’t want to write my own fuzzy matching algorithm, so I decided to borrow someone else’s. Perl comes with a Text::Soundex module that converts any word to a single letter and three digits that collectively correspond to the pronunciation of the string. This is what I initially used for my fuzzy matching.

The module computes the Soundex value for the misspelled subroutine, and then computes the Soundex values for each of the subroutines in the caller’s package. If none match, it mimics Perl’s standard “undefined subroutine called” error message. If one matches, it’s assumed to be the right subroutine. But what if there are multiple matches? This can happen, since Soundex compression can map two similar-sounding subroutine names to the same Soundex value. I thought about this for a while before deciding that the only option would be to pick one at random. I really couldn’t see any other reasonable approach.

The Sub::Approx Module

That’s pretty much how the original version of the module worked. I called it Sub::Approx and released it to CPAN. People started to talk to me about the module, and one of the most common things they said was, “Really interesting idea, but you should do the fuzzy matching using Some::Other::Module.”

So Version 0.05 of Sub::Approx included what I called “fuzzy configurability” (or “configurable fuzziness”) and with the help of Leon Brocard, we made the process of matching a subroutine more modular. We introduced the concept of a matcher, which is a subroutine called with two things: the name of a subroutine that we’re trying to match, and the list of subroutines in the package. The matcher returns an array of the subroutine names that match the required name. We supplied a matcher for each of Text::Soundex, Text::Metaphone, and String::Approx. You can therefore now use Sub::Approx like this:

use Sub::Approx (matcher => 'text_metaphone'),

This makes matching be carried out with Text::Metaphone instead of Text::Soundex.

To make it even more flexible, we allowed programmers to define their own matching subroutines; the subroutines are passed by reference into Sub::Approx. Here, we provide our own subroutine, named reverse:

use Sub::Approx (matcher => &reverse);

sub reverse {
    my $sub = reverse shift;
    return grep { $_ eq $sub } @_;
}
sub abc {
    print "In sub abc!
";
}

&cba;

If our subroutine doesn’t exist, this matcher searches for a subroutine whose name is the reverse of the subroutine we tried to call.

One last feature was the ability to define a chooser function, which decides what to do if more than one subroutine matches. This function, when passed a list of matching subroutine names, should return the name of the one it chooses. The default chooser still picks one at random, but you can define your own like this:

use Sub::Approx (chooser => &first);

sub first {
    return shift;
}

This example isn’t very bright—it’ll always choose the first item in the list of matching subroutines.

The Symbol::Approx::Sub Module

This was how things remained until I gave a lightning talk on Sub::Approx at YAPC::Europe 2000. Afterward, a number of discussions took place that changed the shape of Sub::Approx, resulting in four changes:

  • Perl RFC 324 was drafted, which suggested that in Perl 6, the AUTOLOAD function should be renamed to AUTOGLOB and invoked when any typeglob object that doesn’t exist is called. This would allow us to create Scalar::Approx, Array::Approx, and so on.

  • A mailing list was set up to discuss Sub::Approx and related matters. You can subscribe to the list at http://www.astray.com/mailman/listinfo/subapprox/.

  • The typeglob walking code from Sub::Approx was abstracted out into a new module called GlobWalker so that it could be reused in Scalar::Approx and friends. (Later, I discovered that the Devel::Symdump module on CPAN did much the same thing and switched to that.)

  • We realized that to produce Scalar::Approx and friends, we would be polluting a number of module namespaces. After some discussion on the modules and subapprox mailing lists, we decided on the name Symbol::Approx::Sub.

Symbol::Approx::Sub Version 1.60 is currently on CPAN.

Robin Houston has started work on a Symbol::Approx::Scalar module. Variables are trickier than subroutines for two reasons. First, there is currently no AUTOLOAD facility for variables the way there is for subroutines; Robin gets around this by tying the scalar variables. Second, most variables (at least in good programs) are lexical variables, rather than package variables, and therefore don’t live in typeglobs. Robin (who knows more about Perl internals that I do) has written a PadWalker module that does the same for lexical variables as GlobWalker (or Devel::Symdump) does for typeglobs.

Future Plans

On the mailing list, we are already planning Symbol::Approx::Sub Version 2.0. Planned features include:

  • Separating the matcher component out into two separate stages: canonization and matching. Canonization takes a subroutine name and returns some kind of canonical version, which might include removing underscores or converting all characters to lower case. This suggests having chained canonizers, each of which carries out one transformation in sequence.

  • Developing a plugin architecture for canonizers, matchers, and choosers. This would make it easy for other people to produce their own modules that work with Symbol::Approx::Sub.

  • Trying to accommodate calling packages that already define an AUTOLOAD function.

Even with all of this development, I have yet to find a real use for the module. As far as I can see, it’s simply a very good demonstration of just how easy it is to do things in Perl that would be impossible in other languages. If you think you have an interesting use for Symbol::Approx::Sub, please let the mailing list know.

Afterword

Development of Symbol::Approx::Sub continues. Version 2.00 of the module was released during the Open Source Convention in July 2001. This version implements the plug-in architecture discussed in the article. When Google released the API to their search engine in April 2002, Tatsuhiko Miyagawa combined it with the Symbol::Approx::Sub plug-in architecture to create Symbol::Approx::Sub::Google, which uses Google’s spellcheck feature to do the fuzzy matching.

In the summer of 2001 I gave a talk called “Perl for the People” at both the Open Source Convention and YAPC::Europe. In it I looked at some of the more extreme things that will be possible with Symbol::Approx::Sub. The slides for this talk are online at http://www.mag-sol.com/talks/ppl/.

And we’re eagerly awaiting Larry Wall’s Apocalypse 10, which will tell us whether or not RFD 324 has been accepted for implementation in Perl 6.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset