In my first semester of Spanish class in high school, I went to look up an unfamiliar word, “chaleco,” in a Spanish-English dictionary. I looked under “C”, and found that the dictionary went right from “cetro” to “cía”. Someone had expurgated all the “ch” words! I had a brief nightmarish vision of a world without chorizo, chimichangas, chicharrones, or churros. After some frantic page-turning, I discovered that the “ch” words were in a separate section, “Ch”, between “C” and “D”. I asked the teacher about this, and he explained that it was normal practice for Spanish alphabetical order to consider “Ch” a letter after “C”. But it seemed ludicrous to me—two letters that counted as one. “How pointlessly complicated!” I thought. “Why not just keep it simple, A to Z, like normal? Like English.”
I later learned that every language has its own particular idea of what “alphabetical order” means; the fact that English’s conception of it seems so “normal” is partly because English doesn’t use any accents and partly because of accidents of history.
But many other languages use accented characters that have to be sorted with the 26
letters of the “normal” A-through-Z alphabet. And with other languages,
some combinations of characters, like the “ch” in Spanish, count as
letters on their own. But in almost every case, if you want to sort
according to the conventions of a particular language, the default
behavior of Perl’s sort
won’t sort
that way. This article is about how to get Perl to sort according to the
conventions of whatever language you have in mind—even if it’s
English!
Let’s say you want to sort a list of words (or phrases) in what
you think of as normal English alphabetical order. So you try using
normal sort
:
@stuff = ("canine", "cantaloupe", "cant", # As in an underworld jargon "Canberra", "can't", "Cantonese", "cannery", "Cannery Row", "canonicity", "Cañon de Chelly" # In north-eastern Arizona, also # spelled "Canyon de Chelly" and # "Cañon de Chelle" ); @sorted = sort @stuff; # The sorting happens here print map "[$_] ", @sorted;
That prints:
[Canberra] [Cannery Row] [Cantonese] [Cañon de Chelly] [can't] [canine] [cannery] [canonicity] [cant] [cantaloupe]
Whoa. All the capitals are sorting first. That’s because sort
’s default behavior (what you get
without a “sort criterion” or without use
locale
, both of which we’ll discuss later) is ASCIIbetical sorting—where the sorting is based on ASCII order. Since “C” comes before
“c” in ASCII, all the “C” items in @stuff
(like “Cantonese”) get sorted before
all the “c” items (like “cantaloupe”).
So you happen to remember an idiom for case-insensitive sorting, and you change the line that
sets @sorted
to:
@sorted = sort { lc($a) cmp lc($b) } @stuff;
and then you rerun the code. It prints:
[can't] [Canberra] [canine] [cannery] [Cannery Row] [canonicity] [cant] [cantaloupe] [Cantonese] [Cañon de Chelly]
Closer. Here’s what we want:
[Canberra] [canine] [cannery] [Cannery Row] [Cañon de Chelly] [canonicity] [cant] [can't] [cantaloupe] [Cantonese]
The phrases “can’t” and “Cañon de Chelly” are out of place.
“can’t” is out of place because { lc($a) cmp
lc($b) }
treats “can’t” as a five character string that
sorts before anything else in @stuff
. Consider this code:
print ( "can't" cmp "canal" );
That prints -1
, meaning that
“can’t” comes before “canal”. This is because cmp
is doing simple ASCIIbetical comparison, and when it compares “can’t” to
“canal” it gets as far as comparing “can’” to “cana”. At that point it
sees that the apostrophe character comes before a, because the
apostrophe is ASCII 39, and “a” is ASCII 97.
Now, this is also why “Cañon de Chelly” is coming last: because “ñ” is a character after “n”. For sake of argument, I’ll assume that you are, like me, using Latin-1 as opposed to UTF8, so I can say that “ñ” is a single byte: byte 241, in particular. (If you’re using MacPerl and therefore probably using MacASCII, it’s a different code, but it’s still one byte with a value over 127, so my point stands. If you don’t know what encoding you’re using, you’re probably using Latin-1.)
So what you want is to sort this list according to your idea of English alphabetical order—ignoring apostrophes, treating “ñ” and “n” as the same letter, and of course ignoring case.
What you need is a subroutine we can call like this:
@sorted = sort { normalize($a) cmp normalize($b) } @stuff;
where your normalize
subroutine lc
’s things, turns “ñ”s
to “n”s, and removes apostrophes (in no particular order). That
function could consist of:
sub normalize { my $in = $_[0]; $in =~ tr/Ññ/Nn/; $in =~ tr/'//d; # d for delete return lc($in); }
Paste that into our original code, run it, and it’ll display this (lined up vertically just for better perusal):
[Canberra] [canine] [cannery] [Cannery Row] [Cañon de Chelly] [canonicity] [can't] [cant] [cantaloupe] [Cantonese]
And that’s basically right. Now, the only peculiarity there is
“cant” versus “can’t”. It so happens that when you feed both into your
normalize
subroutine, you get
“cant”. So when your sort criterion compares them using { normalize($a) cmp normalize($b) }
, it’s
performing “cant” cmp “ant”
, which
returns 0, meaning that these two sort identically. But since your use
of sort
produces a list where
“can’t” either comes before or after “cant”, having your sort
criterion return a 0
means that you
don’t care which of the two items comes first in
the output, which effectively means that you can’t predict which will
end up first. Personally, I don’t want my sort criterion to ever be
unpredictable, so I add something that kicks in to avoid returning 0
when comparing different strings:
@sorted = sort { normalize($a) cmp normalize($b) or $a cmp $b } @stuff;
In other words, when normalize($a) cmp
normalize($b)
evaluates to 0, the routine falls through to
returning the value of $a cmp $b
.
That makes this a completely predictable sort criterion, since
$a cmp $b
never returns 0 for
different strings.
However, if you wanted something smarter than just $a cmp $b
, you could use some second
subroutine, normalize2
, that could
be a bit more fine-grained than normalize
. Maybe it would implement the idea
that, in case of a tie in normalize
, words with apostrophes (like
“can’t”) should always come after words without them (like “cant”), or
that “Chile” (the country) should always be after “chile” (the hot
sauce), and so on. You’d call that second part of your sort criterion as:
@sorted = sort { normalize($a) cmp normalize($b) or normalize2($a) cmp normalize2($b) } @stuff;
To be really thorough, you could add in a cmp
at the end:
@sorted = sort { normalize($a) cmp normalize($b) or normalize2($a) cmp normalize2($b) or $a cmp $b } @stuff;
Incidentally, falling back on a second comparison as a sort of “tie-breaker” in a sort criterion is basically what people mean when they refer to “bi-level sorting.” We’ll return to this idea later.
The idea of being able to sort things according to the
conventions of other languages is not a new one. The
perllocale documentation bundled with Perl
describes how to take advantage of locales built into many OSes.
Ideally, you’d set the locale to the language that sorts the way you
want to sort, and then your calls to sort
or cmp
do the right thing. So if I set my
locale to fr_CA.ISO8859-1
(meaning
“French Canadian, using Latin-1”), “étude” will sort (correctly) with
the “e”’s, instead of after the “z”’s, which is how it’d be sorted
ASCIIbetically.
But locales might not be available on all computers. As perllocale points out: “The available locales, the location in which they are kept, and the manner in which they are installed, vary from system to system. Some systems provide only a few, hard-wired locales, and do not allow more to be added; others allow you to add ‘canned’ locales provided by the system supplier; still others allow you or the system administrator to define and add arbitrary locales.”
In other words, if you want to sort a list of French words according to French sorting conventions, even if you can get a French locale to work on one system, and even if that locale’s idea of French sort order is the same as your idea of French sort order, there’s still no guarantee that your locale-based sorting will be able to use the same locale on someone else’s system.
Because of these basic problems with locales, I consider locale-based sorting (even where available) to be fine for one-shot programs, but these portability problems make it unacceptable for use in code that I’d actually want to distribute.
So, in short, much of the code in this article basically duplicates the functionality of some of the sorting you might be able to get from locales, but in a more portable and flexible way.
Earlier, we saw how to treat “ñ” as just an alternate form for “n”, which is appropriate
for English. But suppose you actually wanted to sort a list of Spanish words according to Spanish sorting conventions. In that case, you want to treat “ñ”
not as an alternate form for “n”, but instead as a letter between “n”
and “o”. In that case, you’d develop a sort criterion, as above, based
on a normalize
subroutine, wherein
you’d have to move the letters of the alphabet around like so:
tr<abcdefghijklmnñopqrstuvwxyz> # Map from this ... <abcdefghijklmnopqrstuvwxyz[>; # ...to this
In other words, you want to keep “a” thru “n” as they are, and then have “ñ”—and this means bumping “o” thru “z” down by one character code to make way for the “ñ”. That “[” is there just because it’s the character after “z” in ASCII.
That’s one way to do it. However, I find it a bit confusing, since that way makes the Spanish alphabet look like a strange decoder-ring substitution cypher. What I prefer is this:
tr<abcdefghijklmnñopqrstuvwxyz> <x01-x1B>; # 1B is hex for decimal 27, for the 27 letters
If I add or remove characters from the alphabet on that first
line, all I have to remember to do is change the 1B
there to reflect however many characters
are in the alphabet of characters that I’m starting with. (Since I’m
just going to end up feeding the output of this normalize
subroutine to cmp
, it doesn’t matter whether I’m mapping
the alphabet to the range a-[
or to
the range x01-x1B
.)
Here’s how you’d work this into your normalize
subroutine:
sub normalize { my $in = $_[0]; $in = lc($in); $in =~ tr/Ñ/ñ/; # lc probably didn't catch this $in =~ tr<abcdefghijklmnñopqrstuvwxyz><x01-x1B>; # 1B = 27 return $in; }
Then you can test it with this:
@stuff = ("cana", "Cantata", "caña", "cantina", "canoso", "cañonero", "capa"); @sorted = sort { normalize($a) cmp normalize($b) or $a cmp $b } @stuff; print map "[$_] ", @sorted;
When run, this program returns:
[cana] [canoso] [Cantata] [cantina] [caña] [cañonero] [capa]
which is right! But change @stuff
to these:
@stuff = ("cana", "caña", "cánula", "cantina", "cantó", "canto", "cantor");
and you (re)discover a problem:
[cana] [cantina] [canto] [cantor] [cantó] [caña] [cánula]
And that’s quite wrong. Spanish, you see, uses acute accents (like over the “o” in “cantó”)—but “ó”
isn’t considered a separate letter from “o”. This is the same problem
you faced in the English data set from the start of the article,
except that here it’s not “n” and “ñ” we want to treat as alternates, but “o” and “ó”—and,
while we’re at it, “á/a”, “é/e”, “í/i”, “ú/u”, and the somewhat rare
“ü”. So you change the normalize
subroutine:
sub normalize { my $in = $_[0]; $in = lc($in); $in =~ tr/Ñ/ñ/; # lc probably didn't catch this # lc probably failed to turn É to é, etc. $in =~ tr<áéíóúüÁÉÍÓÚÜ> <aeiouuaeiouu>; $in =~ tr<abcdefghijklmnñopqrstuvwxyz><x01-x1B>; # 1B = 27 return $in; }
Run this code, and you get:
[cana] [cantina] [canto] [cantó] [cantor] [cánula] [caña]
Which (ta-daa!) is The Right Thing.
We now have a sort criterion and an associated subroutine
(normalize
) that together implement
Spanish sorting conventions as far as treatment of ñ, Ñ, and the
accented vowels. But recall from the start of this
article that Spanish has a letter “ch” between “c” and “d”.
So far we’ve been massaging all the data using
character-to-character substitution (using the tr
operator), so that cmp
’s ASCIIbetical character-by-character comparison would do
what we want. However, that all assumes that sorting is about single characters.
But since “ch” consists of two ASCII characters, it won’t fit well
into our plan of using normal cmp
.
And “ch” is not alone: Spanish has one other two-character letter, the
double-ell “ll”, as in “llama”, “quesadilla”, and so on. Now, you
could break down and write a subroutine that basically does the same
work as Perl’s built-in cmp
but
considers character-clusters like “ch” that you want to treat as
single elements. However, that would be very
inefficient compared to the speed of Perl’s builtin cmp
. A more efficient way of doing it
consists of simply turning the clusters into single characters, so
that cmp
can be made to work right
on them. So if you simply turn all occurrences of “ch” to, say, “¢”
(which is presumably not to be found in any of the items we’re
sorting), you can pretend that “chimichanga” is really
“¢imi¢anga” and then you can treat ”¢“ as just another strange letter,
like “ñ” is. Similarly, you could turn “ll” to “£”, say. This
would look like:
sub normalize { my $in = $_[0]; $in = lc($in); $in =~ s/ch/¢/g; # chimichanga => ¢imi¢anga $in =~ s/ll/£/g; # llama => £ama $in =~ tr/Ñ/ñ/; $in =~ tr<áéíóúüÁÉÍÓÚÜ> <aeiouuaeiouu>; # 1D = 29, for the 29 letters we now have $in =~ tr<abc¢defghijkl£mnñopqrstuvwxyz><x01-x1D>; return $in; }
And then you can test it with:
@stuff = ("enchufe", "Enciclopedia de México", "endibia", "enchilada", "encogido", "encanto"); @sorted = sort { normalize($a) cmp normalize($b) or $a cmp $b } @stuff; print map "[$_] ", @sorted;
The output, which is correct is:
[encanto] [Enciclopedia de México] [encogido] [enchilada] [enchufe] [endibia]
Your normalize
subroutine now
correctly implements Spanish-style sorting.
There is a problem with our approach so far—and it might not even be a real problem for you, depending on why you’re sorting your data.
Earlier, I talked about what happens when a sorting subroutine
returns the same value for a pair of items, like “cant” and “can’t”
for an English normalize
, or
“canto” and “cantó” for a Spanish normalize
, or “Chile” and “chile” with
either. So far we’ve been sort of cheating, with criteria like
these:
@sorted = sort { normalize($a) cmp normalize($b) or $a cmp $b } @stuff;
This worked because the last expression, $a cmp $b
, just happens to have correctly
resolved ties that arise with normalize($a)
cmp normalize($b)
. That was just dumb luck. And if, like
many dictionaries, you want “Chile” to come after “chile,” then plain
old cmp
as a tiebreaker does the
wrong thing. So we need bi-level sorting with a normalize2
function as a better
tiebreaker:
@sorted = sort { normalize($a) cmp normalize($b) or normalize2($a) cmp normalize2($b) or $a cmp $b } @stuff;
Let’s implement a normalize2
subroutine that correctly breaks normalize
ties. Let’s continue with Spanish,
and let’s suppose that given a tie between variants of the letter “e”,
the order they should come out in is:
e E é É
Now, you could use the same sort of code as in normalize
, this time implementing an
alphabet consisting of:
a A á Á b B c C ch Ch CH d D e E é É ...
However, consider that normalize2
is just a tie-breaker—it doesn’t
need to distinguish “a” from “b”. It would never be called in a case
where an “a” in one position would need to be compared to a “b” in
another, since that would not have resulted in a
tie between normalized strings. In other words, all normalize2
needs to do is distinguish
letters that normalize
obliterated
the difference between—letters in the same “family”. In other words
(grouping these letters into families), you need only map from
these:
a A á Á b B c C ch Ch CH d D e E é É ...
onto these:
1 2 3 4 1 2 1 2 1 2 3 1 2 1 2 3 4 ...
And you can implement that this way:
sub normalize2 { my $in = $_[0]; # Digraph things... $in =~ s/ch/¢/g; # chimichanga => ¢imi¢anga $in =~ s/Ch/*/g; # Chimichanga => *imi¢anga $in =~ s/CH/*/g; # CHIMICHANGA => *IMI*ANGA $in =~ s/ll/£/g; # llama => £ama $in =~ s/Ll/§/g; # Llama => §ama $in =~ s/LL/¶/g; # LLAMA => ¶AMA # Now the big whammy... $in =~tr<aAbBcC¢**dDeEéÉfFgGhHiIíÍjJkKlL£§¶mMnNoOóÓpPqQrRsStTuUúÚüÜvVwWxXyYzZ> <12121212312123412121212341212121231212123412121212121234561212121212>; return $in; }
To get a better feeling for the output of this function, consider:
normalize2("chile") is "1111" normalize2("Chile") is "2111" normalize2("CHILES RELLENOS!") is "32222 2232222!" normalize2("cantó") is "11113" normalize2("Canto") is "21111" normalize2("CANTÓ") is "22224"
Consider what happens when sorting “chile” and “Chile”; the sort criterion considers the expression:
normalize("chile") cmp normalize("Chile") or normalize2("chile") cmp normalize2("Chile") or "chile" cmp "Chile"
This simplifies to:
"chile" cmp "chile" # First subexpression or "1111" cmp "2111" # Second subexpression or "chile" cmp "Chile" # Last subexpression
The first cmp
subexpression
evaluates to 0, falling through to the expression consisting of the
two values from normalize2
. Between
them, “1111” (from “chile”) comes first ASCIIbetically, so “1111” cmp “2111”
returns -1, to signal that
“chile” should come before “Chile”. (Perl never gets as far as
evaluating the last subexpression, “chile”
cmp “Chile”
.)
Now, this whole business of bi-level sorting may all seem very abstract and, well, foreign, if the only thing you’ve ever sorted is English. But consider if you’re sorting this list of English words:
rot résumé resume rabble return
and you want it to sort correctly:
rabble resume résumé return rot
In other words, you want “resume” to always sort before “résumé”. If you use a one-level sort like this:
@sorted = sort { normalize($a) cmp normalize($b) } @stuff;
then you have a choice. Either treat “e” and “é” as the same
letter (as with “ñ” and “n” in our Canberra/canine/cannery example),
or treat “é” as a letter after “e”. If you treat “e” and “é” as the
same letter, then the ordering of “resume” and “résumé” would be
unpredictable, since your normalize
will return the same value for both.
But if you treat “é” as a letter after “e” (and that seems to be many people’s first guess at a solution, I’ve found), that means that “é” will be a letter between “e” and “f”, and all the “ré-” words will come after all the “re-” words—so that your list will sort as:
rabble resume return résumé rot
That’s wrong. So if you want this to sort right, you need at least two levels in your sorting. And since I’ve yet to see a case where more than two levels of sorting were needed, that pretty much leaves you with bi-level sorting.
Like it or not, the only way to get really correct sorting in English is to use bi-level sorting. And this is not just a problem with English having foreign words like “résumé”—the same problem arises with wanting to sort “Bath” and “bath”, say.
As Perl evaluates a sort criterion while sorting a list, it will ask that criterion to compare several of the items against each other. To see it at work, you can run:
@stuff = sort { print "$a & $b ; "; $a cmp $b } qw(A B C D E F);
and you’ll see something along the lines of:
A & D ; B & D ; C & D ; D & F ; D & E ; E & F ; A & B ; B & C ;
If your criterion, like most of the ones in this article, will
have to call normalize
(and maybe
normalize2
) for whatever items
they’re asked to compare, then you can see that you’re going to be
calling normalize(“D”)
several
times. There’s no point in re-computing it, since
normalize(“D”)
always gives the
same answer, so all the calls after the first are just wasted effort.
To make your sort criterion more efficient, you can cache the results
of the function calls. Caching the results of a function like this is
commonly called memoization.
In other words, instead of evaluating the expression normalize($a)
, you look to see if you
computed it earlier and saved the result. otherwise, you compute the
value and stow it in the cache for next time. So wherever you
have:
function($INPUT)
you would use:
exists($cache{$INPUT}) ? $cache{$INPUT} : ($cache{$INPUT} = function($INPUT))
Worked into our bi-level sort criterion, this would look like:
{ my(%cache, %cache2); @sorted = sort { ( exists($cache{$a}) ? $cache{$a} : ($cache{$a} = normalize($a)) ) cmp ( exists($cache{$b}) ? $cache{$b} : ($cache{$b} = normalize($b)) ) or ( exists($cache2{$a}) ? $cache2{$a} : ($cache2{$a} = normalize2($a)) ) cmp ( exists($cache2{$b}) ? $cache2{$b} : ($cache2{$b} = normalize2($b)) ) or $a cmp $b } @stuff; }
It’s not pretty, but it does avoid having to needlessly
recompute normalize(ITEM)
several
times for each item being sorted. And the only thing better than
correct sorting is faster correct sorting.
(Note: I’ve presented memoization only in the context of sorting. For more general applications, see Mark Jason Dominus’s article on the topic in Computer Science and Perl Programming: Best of the Perl Journal, or the Memoize module in CPAN.)
Now, I’ve heard that in the years since I took my last Spanish class, the Spanish Academy has decided to stop giving “ch” special treatment, so that “churro” will be, they decree, under “C”, somewhere between “cesto” and “cicatriz”. However, I don’t know to what degree this has been accepted by the average Spanish speaker, much less the people who make the phone books and dictionaries in all the Spanish-speaking countries.
But even if everyone’s idea of Spanish sorting conventions suddenly gets simpler (by doing away with those “ch” and “ll” digraphs), it’ll still need bi-level sorting, just like English.
In fact, because implementing the bi-level sorting presented in this article is so common, I’ve written a module called Sort::ArbBiLex that does it for you. The module allows you to specify a sort order (possibly including multi-character letters like Spanish “ch”) for which it builds a subroutine that sorts that way. The internals of Sort::ArbBiLex are frightening, but they’re just an elaborate version of the techniques discussed in this article, adapted to the kinds of sorting found in most languages.
Here’s some example code that defines then uses a sort order for Spanish:
use strict; use Sort::ArbBiLex; *sort_es = Sort::ArbBiLex::maker( # defines &sort_es "a A á Á b B c C ch Ch CH d D e E é É f F g G h H i I í Í j J k K l L ll Ll LL m M n N ñ Ñ o O ó Ó p P q Q r R s S t T u U ú Ú ü Ü v V w W x X y Y z Z " ); my @stuff = ("cana", "caña", "cánula", "cantina", "Canal", "cantó", "canto", "cantor"); print map "[$_] ", sort_es(@stuff);
This code prints:
[cana] [Canal] [cantina] [canto] [cantó] [cantor] [cánula] [caña]
And there we have it: a simple sort
that sorts Spanish text
correctly.