CHAPTER 23

Unicode, Locale, and Internationalization

There are many different languages in the world, each with its own unique set of characters; grammatical sentence structure; local conventions for the representation of dates, times, and decimal points; and myriad other details. In this chapter, we cover two of the major tools at our disposal for handling different languages and country-specific formats, Unicode and locale.

The Unicode Standard is the result of an attempt to encode all the known symbols in the world, ancient and modern, into a single unified database. Perl provides excellent support for handling Unicode, including an implicit understanding of multibyte characters, commonly known as wide characters. Wide characters are necessary because traditional encodings, including Latin 1, the default encoding for the Web as well as Perl source, are designed to fit each character they handle into a single byte, giving only 256 possible characters. Clearly the entire world's symbology is going to need something a little larger. As we discuss Unicode, we will also revisit Perl's support for character encodings and see how to translate Unicode strings to and from other character sets. We will even see how to encode Perl source code.

Localization, abbreviated L10N (because there are ten letters between the "l" and "n"), is concerned with not just the native language of the user of an application, but also where they are in the world—a French speaker can be in France, Belgium, Switzerland, or Canada, to name just a few of the countries where French is spoken. The combination of language and country is called a locale, and each locale has different conventions and standard formats for representing quantities like time and currency. By making use of the locale support provided by the operating system, we can write Perl programs that are locale-aware. Perl can automatically make use of some details of a locale to reconfigure the behavior of built-in functions, but we can also query additional details of a locale to discover the words for "yes" and "no" or the name of the local currency. Locale is primarily supported by Unix platforms, but while Windows handles locale differently, it can still be taught to support locale-oriented programs.

We also take a look at internationalization, abbreviated I18N, which takes the concept of locales and adds to it the ability to generate messages for the correct language and, if necessary, country. This is a lot more complex than it might at first seem, since different languages have different ideas about sentence structure, what is single and what is plural, and how the number zero is handled, to name some of the simpler problems that arise. To internationalize an application, we need to create lexicons of messages and message construction routines that are capable of handling all the various different ways even the simplest message needs to be rendered under different languages.

We start off with an examination of character sets. In order to develop software that is aware of alternative character sets, character encodings, or Unicode, it is helpful to know a little about the background of how character sets came to be. From there we can look at the Unicode standard, what it provides, and the problems that it was designed to solve.

A Brief History of Symbols

A collection of distinct characters or symbols is a character repertoire. It is defined by giving a name to each character and associating that name with a visible representation. For many characters, for instance alphanumeric letters, their name when written down is their visible representation, which makes it tricky to draw a distinction. Punctuation is a better example: "˜" is "a tilde" and "!" is "an exclamation mark."

On its own, a character repertoire is not something a computer can understand. A character code is a numeric value, specifically an integer, associated with a character. Since computers fundamentally only understand numbers, this provides the means to translate each member of a character repertoire into a form that a computer is able to process.

Finally, a character set is the combination of these two concepts, a table that maps every character in a character repertoire to a numeric value. Implicitly, it also provides the inverse mapping of every numeric value (in a given range) to a character. It follows from this that every integer counting from zero is a code for a character and so the numeric value associated with a character is also known as its code position or code point, since it serves as an index for the character set table. For instance, in standard ASCII a capital A has a character code 65, so it is the 66th entry in the character set, counting from zero.

Building Character: ASCII and ISO 8859

Historically, the most established character set was the ASCII standard, a character set represented by numeric codes in the range 0 to 127 and storable in a single byte. This was important, as a byte was the optimum size for 8-bit computers to handle. ISO 8859 later built on ASCII to provide a family of related character sets with codes 0–127 defined identically, and 128–159 explicitly undefined (they do not map to printable characters and are typically used as control characters). Codes 160–255 provided different characters for different language groups, but still in a single byte.

ISO 8859-1, also known as Latin 1, is the default encoding for many applications, including the Web, and is also called Western European. It defines characters suitable for a large number of European countries and has room for a couple of African ones too, but it has no Euro symbol. Other members of the family—there are 16 siblings in all—replace various subsets of Latin 1 with different characters to provide support for other language groups. Some languages, like German, appear more than once (and English is effectively supported in all 16), but different members of the family make different trade-offs between characters and symbols. ISO-8859-15 and ISO 8859-16, the newest members, throw out symbols to make room for more language support.

The ISO 8859 is sufficiently pervasive that it is worth reviewing what it actually contains. Table 23-1 offers the full list of defined character sets.

Table 23-1. The ISO 8859 Family

Encoding Alphabet Languages
ISO 8859-1 Latin 1 West European: Albanian, Basque, Catalan, Danish, Dutch (partial), English, Faeroese, Finnish (partial), French (partial), German, Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish, Spanish, Kurdish, Swedish
African: Afrikaans, Swahili
No Euro symbol
ISO 8859-2 Latin 2 Central and Eastern European: Czech, German, Hungarian, Polish, Slovak, Slovenian
ISO 8859-3 Latin 3 South European: Turkish, Maltese, and Esperanto
Superceded by Latin 5 for Turkish
ISO 8859-4 Latin 4 North European (Baltic): Estonian, Latvian, Lithuanian, Greenlandic, Sámi
ISO 8859-5 Cyrillic Russian, Ukranian, Belarusian
ISO 8859-6 Arabic Arabic (partial)
ISO 8859-7 Greek Greek (ancient and modern)
ISO 8859-8 Hebrew Hebrew
ISO 8859-9 Latin 5 Modified Latin 1: added Turkish, no Icelandic
ISO 8859-10 Latin 6 Modified Latin 4 (Nordic): Sámi, Inuit, Icelandic
ISO 8859-11 Thai Thai
ISO 8859-12 Undefined Not defined
ISO 8859-13 Latin 7 Baltic Rim: Modified Latin 4/6, more glyphs
ISO 8859-14 Latin 8 Celtic: Gaelic, Breton
ISO 8859-15 Latin 9 Modified Latin 1, includes Euro and all French/Finish letters, but fewer glyphs
ISO 8859-16 Latin 10 South-East European: Albanian, Croatian, Finnish, French, German, Hungarian, Irish Gaelic, Italian, Polish, Romanian, Slovenian

Naturally, this is not the whole story. Windows uses its own standard, for instance, where character sets are known as code pages. Of the many available, code page 1252 loosely corresponds to ISO 8859-1, but uses character codes 128–159 to represent special characters (ISO 8859 states that these codes should not represent printable characters).

Meanwhile, elsewhere in the world, languages with character sets totally unrelated to ASCII cannot be represented with any member of the ISO 8859 family. Japanese has several character sets, including ISO 2202, Shift-JIS, and others. Taiwanese uses Big5, Hindi uses ISCII, and so on. All of these character sets have many more than 256 characters, so the character codes must necessarily be bigger than a single byte can encode.

To deal with all these different character sets, the concept of character encodings was invented. Since character codes alone do not tell us what character set they require, the generator of a piece of textual data must include with the data the details of the character set in which it is encoded. For example, a web server does this by including a Charset: header with every document it sends to a client, and a client can tell a server what character sets it will accept with an Accept-Charset: header. In some cases, character set information can be embedded into text so that the rendering application knows to render one piece of text as Latin 1, another as Greek (ISO 8859-7), and a third as Hirigana, all in the same document. But this is really a makeshift solution, and so a better one was designed—Unicode.

Unicode and the Universal Character Set

The modern Unicode Standard is the culmination of two separate efforts to create a true universal character set. The ISO 10646 project defined one single Universal Character Set or UCS. In this scheme, character codes are 32-bit values, subdivided into 65,536 planes of 65,536 characters (16 bits) each. Plane 0, where the top 16 bits are all zero, is known as the Basic Multilingual Plane. It defines the character codes for practically all modern languages, as well as a selection of special characters, sigils, and miscellaneous symbols.

Meanwhile, the first Unicode Standard was published by the Unicode Consortium, an alliance of software developers. It defined not just character codes, but also metadata like the direction of text (left-to-right or right-to-left). Seeing the benefit of synergy, the two standards merged and the Unicode Standard became the definition for all character codes—called code points in Unicode terminology—of the Basic Multilingual Plane, Plane 0 of the UCS. For this reason, Unicode is sometimes thought of as a standard for 16-bit character codes, but as we will see, this is not exactly true, because the standard itself only defines an abstract character code and says nothing directly about how that character code should be stored or transmitted.

Blocks and Scripts

Plane 0 is where all the character sets defined by previous standards like the ISO 8859 family end up, and it is itself subdivided into blocks of contiguous characters. Many blocks are 128 or 256 characters in size, but by no means all. For the curious, the complete list is defined in the file Blocks.txt, which is located in the unicode or unicore (from Perl 5.8 onwards) subdirectory within Perl's standard library.

Within a block, Unicode character codes, also sometimes called code points, are defined with the notation U+code where code is the character code in hexadecimal. The first block, U+0000 to U+007F, is called Basic Latin and corresponds to ASCII. The second, U+0080 to U+00FF, is called Latin 1 Supplement, and together with Basic Latin it corresponds to Latin 1. The fact that the numeric value of these codes exactly aligns with ISO 8859-1 is not an accident, as we will see when we come to discuss encoding Unicode.

Unicode and the UCS break the assumption that every code has a character and every character a single code. While blocks are simple contiguous sequences of characters, scripts are an abstract association of related characters that may span more than one block, while only using selected characters from each. To pick a simple example, Unicode defines two blocks as Cyrillic:

0400..04FF; Cyrillic
0500..052F; Cyrillic Supplement

The first of these blocks corresponds to ISO 8859-5. The second, as its name suggests, provides characters for which there was insufficient room in that standard. While this is convenient for backwards compatibility with 8-bit encodings, it does not allow us to refer to a single collection of character codes as Cyrillic.

This problem is solved by scripts. A script (here in the sense of a character font, rather than a program) may include characters from several blocks, but it may only include some characters from each. Where two blocks contain encodings for the same character, a script will select a code from just one of them to represent the character.

The Cyrillic script includes most but not all of the codes from the preceding two blocks, plus U+1D2B. The Latin script is a good deal more complex, including characters from many different blocks. By doing this, it provides a consistent and complete definition for Latin alphabets that is not limited by the constraints of the ISO 8859 standard. But the script defines only letters—notably, it does not include numbers or punctuation.

Other scripts include Common, where common symbols reside. Every language represented by ISO 8859 (and others) has a script, and even Braille and Linear B are supported. The file Scripts.txt in the unicode/unicore directory contains the complete list, as understood by Perl.

While Plane 0 is filling up, vast stretches of empty space in the UCS do not as yet have any kind of character associated with them. They may be assigned in the future, and indeed the work of codifying the world's symbology is not yet complete. This means that character codes we specify may not actually be legal according to the ISO 10646 and Unicode Standards.

The Encoding of Unicode

Since Unicode character codes are 4-digit hexadecimal numbers, we can see that they can be represented by 16-bit values, which is correct assuming that we are currently in Plane 0 of the UCS, where Unicode reigns. But this code is an abstract one, and it does not necessarily mean that all Unicode characters are encoded, recorded, or transmitted as 16-bit values. That's determined by the character encoding. Previously, encodings switched between different character sets. For Unicode, encodings may apply to the same Unicode character codes, but they now determine how those codes are stored and transmitted, and in particular how many bytes for each character are needed to do it.

For many languages, we only need 8 bits to encode each character, as provided by ISO 8859. For others, notably Asian languages, at least 16-bit values are necessary. As a result, there are several encoding schemes available, some fixed length, with all characters taking up the same number of bytes, and some variable length. For instance, the UCS-2 and UCS-4 fixed-length encodings defined by ISO 10646 allow 2-byte and 4-byte character codes, respectively. In UCS-2, a mechanism is defined to switch the current plane to provide access to characters in other planes. UCS-4 provides access to all planes simultaneously, but it is twice as big.

Both UCS-2 and UCS-4 also have the major disadvantage that programs not expecting Unicode character input will completely fail to comprehend text encoded in either standard, since traditionally characters only took up one byte. UTF-8 addresses this issue by defining a variable length encoding where the most common characters are encoded using 1 byte, while less common characters require more. UTF-8 allows for up to six characters for an encoding, although only codes up to 4 bytes long are currently defined.

UTF-8 solves the backwards-compatibility problem by defining character codes for all the characters defined by ISO 8859-1 in the very same positions as the earlier standard. As a result, if a UTF-8-encoded character stream does not include any characters that require an encoding of greater than one byte, it is 100% compatible with Latin 1, and it will be understood by programs expecting single-byte character input. Only if a multibyte character occurs does special action need to be taken.

It is important to remember that UTF-8 is not Unicode, just a particular encoding that is currently the most popular and practical for compatibility reasons. We have already mentioned UCS-2 and UCS-4. Other encodings include the following:

  • UTF-16: A variable-length encoding that encodes characters using either 16-bit or 32-bit values. Unlike UTF-8, byte order is significant, and UTF-16LE and UTF-16BE define the big-endian and little-endian versions of this encoding. Alternatively, the Byte-Order-Mark (BOM) value 0xFFFE may be prefixed to define the byte order (since bytes will arrive as 0xFE 0xFF for big-endian UTF-16 and 0xFF 0xFE for little-endian UTF-16).
  • UTF-32: A 32-bit encoding that is identical in effect to UCS-4. UTF-32LE and UTF-32BE determine the byte order, or alternatively the BOM, which is FFFE0000 (little endian) or 0000FEFF (big endian).
  • UTF-7: A 7-bit encoding for use where the transmission or storage medium is not 8-bit safe. This typically means legacy ASCII-aware infrastructure that predates ISO 8859.

More information on Unicode can be found at http://www.unicode.org. The UTF-8 encoding is described in http://www.ietf.org/rfc/rfc2279.txt.

Unicode in Perl

Having examined Unicode's background, it is time to look at how Perl manages Unicode. The good news is that, as least as of Perl 5.8, Perl intuitively understands Unicode and can transparently handle much of the hard work.

Perl has comprehensive support for Unicode built into the interpreter and is capable of storing wide characters, requiring more than one byte to store their character code, in an internal variable-width format that corresponds to UTF-8. String data is stored as single-byte character codes using the native character encoding when possible, but if Perl encounters a character that cannot be stored in that encoding, the string in which it occurs is transparently and automatically upgraded to a UTF-8 encoded format.

For the most part, we can leave it to Perl to figure out when support for wide characters is necessary, and so we do not need to worry about what form Perl is storing string data internally. However, for the curious, the Unicode data files that Perl uses as the basis of all Unicode knowledge can be found either in a directory under the standard library called unicore in Perl 5.8 onwards or unicode prior to Perl 5.8—on a Unix platform a typical location is /usr/lib/perl5/5.6.1/unicode or /usr/lib/perl5/5.8.5/unicore, while on Windows look for a directory like C:/Perl/lib/unicore. We will return to these files later in the chapter.

Unicode support in Perl 5.6 was a code-oriented feature, and the use utf8 pragma was used to switch Perl code into and out of Unicode-aware operation. This proved to be inelegant, so Perl 5.8 instead makes the data itself Unicode-aware and sets a UTF-8 flag on each string or filehandle that contains or needs to be aware of wide characters. Perl's string-based functions and operators now transparently adapt to manage wide characters whenever they encounter a string or filehandle that demands it. As a result, we rarely need to query the UTF-8 flag or explicitly convert strings between byte-oriented and wide-character interpretations.

Writing in Unicode

Perl supports the generation of Unicode symbols through interpolation. To support Unicode, two new interpolation sequences are provided: x{code}, which interpolates a hexadecimal Unicode character code, and N{name}, which interpolates from a name defined by the Unicode Standard. With either notation, we can create strings with Unicode characters.

As a practical example, take the following example, which prints a string containing a Psi character in the Cyrillic alphabet using the x{code} notation:

#!/usr/bin/perl
# unicodex.pl
binmode STDOUT,":utf8";
print "x{470} is a Cyrillic capital Psi ";

To print this string out, we have to tell Perl that standard output is capable of handling Unicode characters; otherwise, Perl will conclude that the string cannot be printed, and we will get a "Wide character" warning instead of output. We enable this feature by setting the UTF-8 layer on the STDOUT filehandle with binmode, as we saw back in Chapter 12.

We can also create Unicode characters with the built-in chr function. Given a character code greater than 255, chr will automatically deduce we want a wide character and generate a wide-character string:

my $Psi=chr(0x470);

When x is used with braces, it defines a Unicode character, so the string must by necessity be promoted to a UTF-8 string, even if the encodings in question are only single byte. x{40} is not the same as x40—the former represents a Unicode character that will promote the string that contains it to a wide-character internal format, whereas the latter represents a character code in the currently active native 8-bit format. The fact that both encodings may happen to represent the same character (@) is immaterial to whether Perl subsequently treats the string as 8-bit or variable-width encoded.

The Unicode Standard defines a character repertoire, which as we discussed at the start of the chapter means that every character has a name. We can use these names in place of the character codes with the charnames pragma, which gives the ability to specify characters using full names, short names, or within individual scripts. To use the full official names defined by the standard, we specify a :full argument, like so:

#!/usr/bin/perl
# unicodefull.pl
use charnames qw(:full);    # use Unicode with full names
print "N{CYRILLIC CAPITAL LETTER PSI} is a Cyrillic capital Psi ";
print "N{GREEK CAPITAL LETTER PSI} is a Greek capital Psi ";
print "N{CYRILLIC SMALL LETTER PSI} is a Cyrillic small psi ";

As the term suggests, full names are always verbose and in uppercase (or more correctly are case insensitive), and so we must spell out exactly what character we mean with keywords like SMALL and CAPITAL for case. Alternatively, we can import short names, which provide shortcuts to the official names and which use case in the name to determine the case of the resulting character:

#!/usr/bin/perl
# unicodeshort.pl
use charnames qw(:short);   # use Unicode with short names
print "N{Cyrillic:Psi} is a capital Psi in Cyrillic ";
print "N{Greek:Psi} is a capital Psi in Greek ";
print "N{Cyrillic:psi} is a lowercase psi in Cyrillic ";

Finally, we can import character names from specified scripts, in which case we can use an even shorter name. The scripts we specify help Perl to disambiguate which script we mean, so we can omit it unless we happen to import two scripts and use a character with the same name in each. The following program allows us to just say Psi because we exclusively select Cyrillic character names:

#!/usr/bin/perl
# unicodescript.pl
use charnames qw(cyrillic); # use explicit Unicode script
print "N{Psi} is a capital Psi in Cyrillic ";
print "N{psi} is a lowercase psi in Cyrillic ";

If we do not fully specify a letter character, then Unicode defines a resolution order that determines which of several possible characters is selected. For instance, if we ask for a letter with both upper- and lowercase forms but do not specify which we want, the uppercase variant will be selected. In terms of the full character name, the search order is

SCRIPT CAPITAL LETTER NAME
SCRIPT SMALL LETTER NAME
SCRIPT LETTER NAME

With short names in effect, this search order determines which letter is found within a given script, with the qualification that a NAME that is all lowercase ignores a CAPITAL variant, and any other name ignores the SMALL variant. The case used for a short name is disregarded for case-invariant characters, such as numerals.

These names, and their numeric equivalents, will also work in Perl's regular expression engine, which understands wide-character search patterns just as functions like length do. For example:

#!/usr/bin/perl
# whiteknight.pl
use charnames ":full";

# interpolate Unicode character into string
my $chess_move = "White moves N{WHITE CHESS KNIGHT}";

# match UTF-8 string against UTF-8 pattern
print "Check! " if $chess_move =˜ /N{WHITE CHESS KNIGHT}/;

This short program generates a string containing a white knight chess piece symbol, then matches it with a regular expression containing the same symbol.

Converting Between Names and Codes

The charnames pragma also supplies a pair of useful functions that allow us to retrieve the full Unicode name for a given character code or the character code for a supplied Unicode name. The viacode function takes an integer code value and returns the name for it or undef if the code is invalid. We can use it to create a simple name listing utility like this:

#!/usr/bin/perl
# uninames.pl
use strict;
use warnings;
use charnames ':full';

die "Usage: $0 <start> <end> " unless @ARGV>=1;
my ($start,$end)=@ARGV;
$end=$start unless $end;

die "Bad range $start..$end "
    if ($start.$end)=˜/D/ or $end<$start;

for (my $code=$start; $code<=$end; $code++) {
    printf "%6d = U+%04X : %s ", $code, $code,
      charnames::viacode($code) || '*Invalid*';
}

We can use this tool to produce either a single name or a list of sequential names like this:

> uninames.pl 1005 1010


  1005 = U+03ED : COPTIC SMALL LETTER SHIMA
  1006 = U+03EE : COPTIC CAPITAL LETTER DEI
  1007 = U+03EF : COPTIC SMALL LETTER DEI
  1008 = U+03FO : GREEK KAPPA SYMBOL
  1009 = U+03F1 : GREEK RHO SYMBOL
  1010 = U+03F2 : GREEK LUNATE SIGMA SYMBOL

If we try an invalid code, we instead get the following:

> uninames.pl 567


   567 = U+0237 : *Invalid*

Here is a script that goes the other way, displaying the character code for the specified name:

#!/usr/bin/perl
# getcode.pl
use strict;
use warnings;
use charnames ':full';

die "Usage: $0 <name> " unless @ARGV;
my $name=uc(join ' ',@ARGV);

my $code=charnames::vianame($name);
if ($code) {
    printf "%6d = %04X : %s ", $code, $code, $name;
} else {
    print "  $name is *Invalid* ";
}

This translator is unforgiving, though—it will not return a result unless we give it a complete full name:

> ./getcode.pl A


 A is *Invalid*

> ./getcode.pl LATIN CAPITAL A


  LATIN CAPITAL A is *Invalid*

> ./getcode.pl LATIN CAPITAL LETTER A


  65 = 0041 : LATIN CAPITAL LETTER A

Accessing the Unicode Database

Perl keeps all the data files and scripts that provide support for Unicode in a subdirectory called unicore in the Perl standard library directory. (In Perl 5.6, the subdirectory was called unicode, but it has since been renamed to avoid conflicting with the Unicode:: module family on case-insensitive file systems.)

Many of the files in this directory are text files that together comprise the Unicode Standard specification in machine-parsable format, taken from the Unicode Standard repository at http://www.unicode.org. Script names can be found in Scripts.txt, blocks are defined in Blocks.txt, and so on. Other files here, notably the Perl scripts and the contents of the To and lib subdirectories, are created as part of the Perl installation by the mktables script also located in this directory. If we ever had a reason to modify one of the text files, this script can regenerate all the secondary files for us. Taken together, they comprise the Unicode database.

Modules like the charnames pragma allow us to use the database without worrying about its low-level details. If we need more direct access, we use the Unicode::UCD module, which provides a low-level programmatic view of the Unicode data files. Of most immediate interest are the charblocks, charscripts, charblock, and charscript routines, which as their names suggest access block and script information.

Block and Script Information

The charblocks function returns details of all defined blocks, organized into a hash of arrays, with the name of each block being the block name and the arrays being one or more ranges, themselves arrays with the start and end character codes as the first and second elements. The block name is the third element, but we already have it from the key. The following program dumps this structure out in legible form:

#!/usr/bin/perl
# charblocks.pl
use strict;
use warnings;
use Unicode::UCD qw(charblocks);
my $blkref=charblocks();

foreach my $block (sort keys %$blkref) {
    printf "%40s :", $block;
    my @ranges=@{$blkref->{$block}};
    foreach my $range (@ranges) {
        printf " U+%04X..U+%04X", @$range;
    }
    print " ";
}

Running this program generates a long list of blocks with entries like these:

     Cypriot Syllabary : U+10800..U+1083F
              Cyrillic : U+0400..U+04FF
Cyrillic Supplementary : U+0500..U+052F
               Deseret : U+10400..U+1044F
            Devanagari : U+0900..U+097F
              Dingbats : U+2700..U+27BF

Each block name can only ever have one range associated with it. However, we can use the charscripts function to generate a similar data structure, and a script can contain many ranges (with a single character being a range from itself to itself). If we replace the call to charblocks in the last script with this line:

my $blkref=charscripts();

we now get a list of scripts with associated ranges, similar to this:

Cherokee : U+13A0..U+13F4
 Cypriot : U+10800..U+10805 U+10808..U+10808 U+1080A..U+10835
           U+10837..U+10838 U+1083C..U+1083C U+1083F..U+1083F
Cyrillic : U+0400..U+0481 U+0483..U+0486 U+048A..U+04CE U+04D0..U+04F5
           U+04F8..U+04F9 U+0500..U+050F U+1D2B..U+1D2B
 Deseret : U+10400..U+1044F

To retrieve information for a specific block or script, we can use the singular versions of these functions. Both take an argument of either a block or script, respectively, or a character code. With a name, the corresponding range (or ranges in the case of scripts) are looked up and returned as an array of range arrays, just like the values of the hashes generated by the charblocks and charscripts routines. With a value, the corresponding block or script name is returned. The value may be specified either as an integer or in the U+NNNN format, in which case the NNNN is interpreted as a hexadecimal code. Here is a program that looks up block names and code ranges using charblock:

#!/usr/bin/perl
# charblock.pl
use strict;
use warnings;
use Unicode::UCD qw(charblock);

die "Usage: $0 <block name|code value|U+NNNN> ... "
    unless @ARGV;

foreach my $block (@ARGV) {
    # get the block for a code or the code ranges for a block
    my $ranges=charblock($block);

    if ($ranges) {
        if (ref $ranges) {
            print "Block $block:";
            # mapped name to array of ranges
            foreach my $range (@$ranges) {
                printf " U+%04X..U+%04X", @$range;
            }
        } else {
            # mapped a code to a block name
            print "Code $block is in the block '$ranges'";
        }
    } else {
        print "$block : *Invalid code value or block name*";
    }
    print " ";
}

Using this script, we can generate output like this:

> charblock.pl Cyrillic 1234 U+1234 Latin "Basic Latin"


Block Cyrillic: U+0400..U+04FF
Code 1234 is in the block 'Cyrillic'
Code U+1234 is in the block 'Ethiopic'
Latin : *Invalid code value or block name*
Block Basic Latin: U+0000..U+007F

The charscript version of this program is identical in all respects other than replacing the call to charblocks with a call to charscripts. If we do this (and rename block to script throughout), we produce output like this:

> charscript.pl Cyrillic 1234 U+1234 Latin "Basic Latin"


Script Cyrillic: U+0400..U+0481 U+0483..U+0486 U+048A..U+04CE U+04D0..U+04F5
U+04F8..U+04F9 U+0500..U+050F U+1D2B..U+1D2B
Code 1234 is in the script 'Cyrillic'
Code U+1234 is in the script 'Ethiopic'
Script Latin: U+0041..U+005A U+0061..U+007A U+00AA..U+00AA U+00BA..U+00BA
U+00C0..U+00D6 U+00D8..U+00F6 U+00F8..U+01BA U+01BB..U+01BB
U+01BC..U+01BF U+01C0..U+01C3 U+01C4..U+0236 U+0250..U+02AF U+02B0..U+02B8
U+02E0..U+02E4 U+1D00..U+1D25 U+1D2C..U+1D5C U+1D62..U+1D65 U+1D6B..U+1D6B
U+1E00..U+1E9B U+1EA0..U+1EF9 U+2071..U+2071 U+207F..U+207F U+212A..U+212B
U+FB00..U+FB06 U+FF21..U+FF3A U+FF41..U+FF5A
Basic Latin : *Invalid code value or script name*

The charinrange routine combines the features of the preceding functions to provide the ability to test whether a given character is a member of a given script or block. It has the same effect as the p{...} notation in regular expressions, when the property is a script or block, but it operates on the character code rather than the character(or a string containing the character). For example:

#!/usr/bin/perl
use strict;
use warnings;
use charnames ':short';
use Unicode::UCD qw(charinrange charblock);

my $char="N{Cyrillic:Psi}";
print "In block Cyrillic! " # string via interpolation
    if $char =˜ /p{InCyrillic}/;

print "In block Cyrillic! " # character code via charinrange
    if charinrange(charblock("Cyrillic"), ord($char));

The two tests in this short script are effectively equivalent, but the charinrange version is more efficient if we are interested in testing individual characters. Conversely, the property is easier for testing strings.

Detailed Character Information

The charinfo routine of Unicode::UCD extracts every piece of information the Unicode specification has to say about a given character code, including its block and script, its full name, its general category (which is also a property that it can be matched to), and the upper- or lowercase equivalent, if any exists. This short script uses charinfo to print out all defined values for the supplied character code or codes:

#!/usr/bin/perl
# charinfo.pl
use strict;
use warnings;
use Unicode::UCD qw(charinfo);

die "Usage: $0 <code value|U+NNNN> ... "
    unless @ARGV;

foreach my $char (@ARGV) {
    my $info=eval { charinfo($char) };

    if ($info) {
        print "Character code U+",(delete $info->{code}),": ";
        print map {
            $info->{$_} ? sprintf "%13s = %s ", $_, $info->{$_} : ()
        } sort keys %$info;
    } else {
        print "$char : *Invalid code* ";
    }
}

We can use this script like this:

> charinfo.pl 1234


Character code U+04D2:
        bidi = L
       block = Cyrillic
    category = Lu
decomposition = 0410 0308
        lower = 04D3
     mirrored = N
         name = CYRILLIC CAPITAL LETTER A WITH DIAERESIS
       script = Cyrillic

From this we can see that this is an uppercase letter (both from the name and the category) and that it has a lowercase variant. We can also see from its decomposition that it can be made up from two other unicode characters, which if we look them up turn out to be CYRILLIC CAPITAL LETTER A and COMBINING DIAERESIS.

Its default rendering order is left to right, and it is not mirrored (reflected) if forced into a right-to-left order. By contrast, this is what the Basic Latin numeral 6 returns:


./charinfo.pl 54
Character code U+0036:
         bidi = EN
        block = Basic Latin
     category = Nd
      decimal = 6
        digit = 6
     mirrored = N
         name = DIGIT SIX
      numeric = 6

As 6 is a digit and not a letter, it is in the Nd category and does not belong to a script. Its default rendering order is neutral, so it can be controlled by the prevailing direction. Being a number, it also has a numeric value but no uppercase or lowercase variant.

Case Mapping, Folding, and Composition Exclusion

The Unicode::UCD module provides three more routines for export, all of which are concerned with case manipulation and conversion, that we will only mention briefly here:

  • compexcl returns true if the specified character has a composition exclusion—that is, it should not be combined with a modifier to create a composite character. For example, we can add an accent to a u, for example, but we can't necessarily add one to an ü. Accents are usually available both bundled with characters and as combining characters that modify the appearance of the preceding character, but only if the Unicode database says that such a modification is acceptable.
    This is more complex than it might seem: depending on what script they come from, characters with the same visual representation may have different restrictions. See also the Unicode::Normalize module for routines to decompose characters into their constituents, recompose constituents back into composite characters, and more information on character composition in general.
  • casefold returns a structure containing information on how the character's case may be folded, in a locale-independent way.
  • casespec returns the valid case mappings of the specified character, possibly modified by the locale currently in effect.

In the rare cases that we might need to check which version of the Unicode Standard Perl is using, we can also extract it with the nonexportable function Unicode::UCD::UnicodeVersion. For example:

> perl -MUnicode::UCD -e 'print Unicode::UCD::UnicodeVersion()'


4.0.1

Implications of Wide Characters in Perl

Before using Unicode in a Perl application or module, it is important to keep in mind that while Perl will for the most part transparently handle wide-character strings just as easily as traditional 8-bit ones, some features of Perl will not behave in quite the same way once wide-character support is enabled. In particular:

  • The chr and ord functions no longer necessarily map between single characters and single bytes, and the return value of ord may exceed 256.
  • When an 8-bit encoded string is used in an operation with a UTF-8 encoded string, a copy of the 8-bit encoding is automatically made and upgraded to UTF-8. This is rather like tainting in strings, and it works the same way: a wide-character string "widens" any string that is generated from an operation in which it takes part, even if the resulting string does not contain any wide characters.
  • The length of a string is no longer necessarily equal to the number of bytes in the string.
  • The . any-character pattern in regular expressions now potentially matches more than one byte. To explicitly match a byte, we must instead use the C pattern metacharacter.
  • Functions like length, index, and substr take longer to execute on UTF-8 strings because they must scan through the string and look for multibyte characters in order to count characters to find the desired position in the string. This is true even if the string only contains single-byte characters and so is compatible with Latin 1, since Perl still has to check.
  • Not all Perl features recognize wide-character text. The major exceptions are filing system operations, since the file system is most likely not using Unicode for file names, even their contents are encoded in UTF-8. The %ENV and @ARGV arrays are also not marked as wide-character, even if Unicode is enabled everywhere else. The -C command-line option can be used to tell Perl that @ARGV is encoded in UTF-8, but %ENV is always interpreted as 8-bit Latin 1. The pack and unpack functions also remain byte-oriented but provide the U placeholder to pack and unpack UTF-8 encoded strings.

Reading and Writing Unicode

The PerlIO system provided by Perl 5.8 onwards allows any filehandle to be marked as accepting or not accepting wide characters by setting the :utf8 layer for it. As was discussed back in Chapter 12, layers can be added and removed in a number of different ways. Encoding layers, including the :utf8 layer, are no different.

If we are opening a filehandle for the first time, we can set the layer at the same time, with open:

open INUTF8, "<:utf8", "unicode.txt";

If we already have a filehandle, like STDOUT, we can use binmode to set the layer on it, as we saw in our first Unicode example:

binmode STDOUT,":utf8";

Alternatively, we can set the default filehandle layers for all new filehandles with the open pragma. Both of these statements tell Perl to set the :utf8 layer on all new filehandles for both input and output:

use open ":utf8"; # mark all new filehandles as UTF8
use open IO => ":utf8"; # likewise

If we add an argument of :std to this statement, Perl also marks the STDIN, STDOUT, and STDERR filehandles with whatever the default encoding is for new filehandles.

use open ":utf8", ":std"; # also mark STDIN/OUT/ERR

The open pragma also has the useful property that if we specify :locale, which implies :std, Perl will enable whatever encoding is set for the locale (as determined from the LC_ALL, LC_CTYPE, or LANG environment variables) as the default for input and output on all filehandles:

use open ':locale';

We can also use the encoding pragma to automatically mark standard input, standard output, and literal string text (including any text in the __DATA__ section of the source file) as UTF-8. However, while the leading colon is optional for binmode, it is not permitted in the argument to the pragma:

#!/usr/bin/perl
# unicodex2.pl

use encoding "utf8";

print "x{470} is a Cyrillic capital Psi ";

Of course, unless we happen to be using a Unicode-aware shell, the actual output from running any of these scripts is quite likely to be a pair of strange characters corresponding to the 2 bytes that make up the wide character (0x04 and 0x71 in this case) and not an actual Psi character.

The encoding pragma provides an interface between PerlIO and the Encode module, which contains definitions for all character encodings known to Perl. Any encoding can be applied, so we can use the pragma to convert UTF-8 to and from other character encodings. By mentioning a filehandle and a secondary encoding—in the same way as the open pragma—we can produce the same effects as a binmode statement to have different encodings for input, output, and in-code text:

use encoding 'iso-8859-5', STDOUT => 'utf8';

This statement tells Perl that string literals are encoded in ISO 8859-5 (8-bit Cyrillic), but any data sent to standard output should be encoded as Unicode characters using the UTF-8 encoding.

We can also use the pragma from the command line. Changing the internal encoding of the script is not very practical, but we can make the input and output UTF-8 encoded with

> perl -Mencoding=STDIN,utf8,STDOUT,utf8 myscript.pl

From Perl 5.8, we also have the -C option to switch on UTF-8 encodings for filehandles and command-line arguments, but not literal text. It takes either a collection of letters or a numeric value as its argument. These arguments are listed in Table 23-2.

Table 23-2. Arguments to Perl's -C Option

Letter Numeric Value Description
I 1 STDIN marked as :utf8.
O 2 STDOUT marked as :utf8.
E 4 STDERR marked as :utf8.
i 8 All new filehandles marked as :utf8 for input.
o 16 All new filehandles marked as :utf8 for output.
A 32 The contents of the @ARGV array are marked as UTF-8 (that is, wide-character) strings.
L 64 Enable the preceding only if the locale indicates UTF-8 is acceptable (determined, in order, from the environment variables LC_ALL, LC_TYPE, and LANG).
S 7 Shorthand for IOE.
D 24 Shorthand for io.

The previous example can be rewritten using the -C option:

> perl -CIO myscript.pl

Or, adding the numeric values for input (1) and out (2) together, we can equivalently use

> perl -C3 myscript.pl

A -C option with no argument or an argument of zero switches on all features, except for A - @ARGV processing, and is equivalent to -CIOEioL or -CLSD. The S and D options are just shorthand for combinations of other letters.

The environment variable PERL_UNICODE also controls this setting. If -C is not specified, PERL_UNICODE is examined for a series of option letters or a numeric value. For example, in most Unix shells, we can type the following:

> PERL_UNICODE=3 perl myscript.pl

Of course, the real point of this variable is to set the desired behavior as part of the default environment.

Within Perl, the setting of the -C flag can be read as a numeric value with the special variable ${^UNICODE}. This reflects the settings that were in effect when Perl started, and it never changes. In particular, it is not affected by any use of open (pragma or function) or binmode to set the default layers or the layers on the standard filehandles.

If we print a wide-character string to a filehandle that is not so marked, Perl will generate a wide-character warning unless there is a valid mapping for the character into the selected output encoding. Since not every character in a Unicode script can be represented by a more restricted encoding like the ISO 8859 family, this may not always work, even if the output encoding is for the same language.

Unicode and Other Character Sets

Since we can set different encodings for internal text and for filehandles, we can transparently map from one encoding to another in Perl, just by setting the encodings appropriately. For example, this Perl one-line command converts text files written in ISO 8859-5 into Unicode:

> perl -CO -Mencoding=STDIN,iso-8859-5 -ne "print" < cyrillic.txt > utf8.txt

The piconv utility that comes with Perl is a version of the standard iconv tool on many Unix platforms that makes use of Perl's Encode module to do the work instead of the operating system's native libraries. We can achieve the same effect as the preceding command with

> piconv --from iso-8859-5 --to utf8 < cyrillic.txt > utf8.txt

The script is not complex, as it simply wraps Perl's standard encoding functionality with a few command-line options, and the main part of it amounts to less than a hundred lines. Programmers interested in making more direct use of Perl's Encode module may find it worth examining.

It is also interesting to examine what Perl is actually doing with strings encoded in non-Latin 1 character sets. The built-in ord function comes in useful here as it can tell us what character code Perl has chosen to use internally. The short program that follows is written in Cyrillic, as defined by ISO 8859-5. The only practical effect of this is that the character with code 0xD0 assigned to $char is evaluated as a member of ISO 8859-5:

#!/usr/bin/perl
# map88595touni.pl
use encoding 'cyrillic';

my $char="xD0"; # no {} braces = 8-bit encoding
printf "U+%04X ", ord($char);

Internally, Perl maps the character to a 16-bit Unicode character when it compiles the string. We can see when we run the program and get this output:


U+0430

For 8-bit encodings, it is relatively simple to write a script that shows the mapping for each of the 256 possible values to its Unicode equivalent. Since the bottom half always maps directly for ISO 8859 character sets, we can reduce this to the top 128 codes:

#!/usr/bin/perl
# mapall88595uni.pl
use encoding 'cyrillic';

foreach my $code (128..255) {
    my $char=chr($code);
    printf "%02X => U+%04X ", $code, ord($char);
}

This generates 128 lines of output of the following form:


...
EE => U+044E
EF => U+044F
F0 => U+2116
F1 => U+0451
F2 => U+0452
...

Notice that the mapping is not linear and that some characters in this character set are not in the Cyrillic block or the Cyrillic script in the Unicode Standard. U+2116 is, as it turns out, a NUMERO SIGN.

It is always possible (other than in exceptional circumstances) to map a character in any other character encoding into Unicode. However, the reverse transform is not always possible, even for characters that are clearly members of the language for which the encoding is designed. This is because encodings like the ISO 8859 cannot always provide an encoding for every known character of a script. The Unicode character U+0470, which we have been using as an example in this chapter, is a member of both the Cyrillic script and of the Cyrillic block. Even so, it does not have a direct mapping to the 8-bit ISO 8859-5 encoding for Cyrillic:

> perl -Mencoding=utf8,STDOUT,iso-8859-5 -e 'print "x{470}"'


"x{0470}" does not map to iso-8859-5.

Character code U+0430 (which charinfo tells us is a CYRILLIC SMALL LETTER A), on the other hand, does map:

> perl -Mencoding=utf8,STDOUT,iso-8859-5 -e 'print "x{430}"'


a

Of course, unless the shell is currently using ISO 8859-5, we won't actually see the character in the encoding. If we were to now use ord to find the character code of this character, we would find that it was 223, or D0 in hexadecimal, which is the value we began with at the start of this section.

Encoding Perl Source

So far we have looked at strings and filehandles and how they deal with wide-character data. But we can also encode Perl source itself in Unicode or any other native character set that the underlying platform supports. This allows us to use variable and subroutine names—but not the names of packages—using characters outside the usual Latin 1 range, even using wide characters if we want to.

To encode Perl source itself, we again use the encoding pragma, but now we make use of the Filter option. This assigns the specified encoding to a source filter (provided through the Filter::Util::Call interface described in Chapter 18) so Perl code is passed through the Encode module before it even gets compiled.

For example, to use Unicode characters in our variable and subroutines, we can use

use encoding 'utf8', Filter => 1;

Although the Filter option fundamentally changes the way the pragma works, it maintains compatibility with the nonfiltered uses too, so we can still set different encodings for STDIN and STDOUT as before. So, we can write our code—including string literals—in ISO 8859-5 (the 8-bit encoding for Cyrillic), but have Perl print out in UTF-8 with either of the following:

use encoding 'iso8859-5', Filter => 1, STDOUT => 'utf8';
use encoding 'cyrillic', Filter => 1, STDOUT => 'utf8';

Perl compiles variable and subroutine names into either a Latin 1 or a UTF-8 names, depending on whether the characters that make each name up can be represented in Latin 1 or not, just as it does for string literals. This detail is however mostly moot—the result is that we can write source in any encoding we choose.


Note If all we want to do is write literal strings in a native 8-bit encoding, we do not need to use the Filter option—the encoding pragma with a single encoding as its argument does that. We only need to use the Filter option if we want to create variables and subroutines using characters outside the Latin 1 encoding.


Sorting In Unicode

While Perl's sort function can sort wide-character strings just as it can traditional 8-bit ones, it only provides one possible interpretation of an appropriate sort order, based purely on comparison of character codes via the cmp operator, which it uses by default when no explicit sort subroutine is supplied.

Unicode is of course more complex than this, and there is no official relationship between the character code of a character and its "order" in relation to other characters. The Unicode::Collate module provides a complete implementation of the Unicode sorting algorithm and can be used to sort UTF-8 encoded strings under a variety of different criteria, such as the following:

  • Whether uppercase ranks higher or lower than lowercase
  • Which characters should be ignored entirely for the purpose of sorting
  • Whether composite characters should be compared according to their composite forms or decomposed into multiple constituent characters first

A simple use of the module is to create a collation object that can be used to sort characters using the default behavior defined by the Unicode Standard. For example:

my $sorter=new Unicode::Collate;
my @sorted = $sorter->sort(@unsorted);

We can also make use of the cmp method inside Perl's built-in sort function:

my @sorted=sort { $sorter->cmp($a,$b) } @unsorted;

The collation object can be customized in many ways when it is created by passing in one or more key-value pairs to adjust its behavior. For example, this collator object ignores whitespace, strips a leading "the " if present, and ranks uppercase letters after lowercase ones:

my @sorter=new Unicode::Collate(
    ignoreChar => qr/s/,
    upper_before_lower => 0,
    preprocess => { $_[0]=˜/^thes+/i },
);

This is just a simple example of what we can do with the collator object. For more advanced configuration options, see the Unicode::Collate manual page.

Detecting And Converting Wide-Character Strings

Perl's implementation of wide-character support is designed so that, for the most part, we can ignore the way that it works and just use wide characters without any extra effort. However, if we really need to find out whether a given string is marked as UTF-8 or byte-encoded, we can use the is_utf8 function:

#!/usr/bin/perl
# is_utf8.pl

my $tee=chr(0x74);
print "is t wide? ",utf8::is_utf8($tee)?"yes":"no"," ";
my $Psi=chr(0x470);
print "is Psi wide? ",utf8::is_utf8($Psi)?"yes":"no"," ";

This script will let us know that, as suspected, the letter "t" is not a wide character, but a Cyrillic Psi is. Of course, it is really the string that is being checked, but these strings only contain one character.

The is_utf8 function and the other functions mentioned in this section are provided by Perl in the utf8 package. They are not provided by the utf8 pragma, and as the preceding example shows, we do not invoke use utf8 to make use of them.

One problem with wide-character support is that it forces Perl to scan through strings that are marked as potentially containing wide characters in order to determine positions. However, we might happen to know that there are no wide characters in a string, and so there is no need for Perl to carry out this extra work. Alternatively, we might want to treat the string as a byte sequence even if it does have wide characters. Either way, we can convert a string to its byte-oriented form with the encode function. To go the other way and turn a byte-sequence into a wide character string, we use decode. Both functions carry out an in-place conversion of the original string.

#!/usr/bin/perl
# utf8encode.pl

my $Psi=chr(0x470); # a one character wide-character string
print "Wide-character length: ",length($Psi)," ";
print "First character code : ",ord(substr $Psi,0,1)," ";
utf8::encode $Psi; # convert to byte-oriented form
print "Byte length: ",length($Psi)," ";
print "First character code : ",ord(substr $Psi,0,1)," ";

my $decoded_ok=utf8::decode $Psi; # convert back
print "Wide-character again: ",length($Psi)," ";
print "First character code : ",ord(substr $Psi,0,1)," ";

When run, the output from this script is


Wide-character length: 1
First character code : 1137
Byte length: 2
First character code : 209
Wide-character again: 1
First character code : 1137

As expected, when we encode the string, the length changes to 2 and ord on the start of the string returns a number less than 256. Clearly, although the string has not actually changed contents, it will now behave very differently with Perl's string functions, including regular expressions like w.

The encode function always succeeds, since any wide-character string must obviously have an equivalent byte-oriented form, and it clears the UTF-8 flag to mark the string as a byte-encoded one.

The decode function can fail, however, if the sequence of bytes in the string corresponds to invalid wide-character codes. We should therefore check the return value, which is true on success and false on failure. If the decoding succeeds, the resulting string still may not be marked with the UTF-8 flag, though. This will only happen if a wide character was found in the string, since there is no point slowing down string functions without reason.

There are three other functions in the utf8:: namespace, but we are less likely to find a use for them. upgrade is like the encode function, but does not manage the UTF-8 flag. The corresponding downgrade function performs the opposite operation. Like decode, it can fail. It takes an additional argument, which if not true will cause Perl to die on error. Finally, valid checks whether a string can be successfully converted by decode or downgrade.

Unicode In Regular Expressions

The integration of Unicode into the regular expression engine allows us to match text according to Unicode properties, shortcuts for various character classes defined by the Unicode Standard. Properties come in several flavors depending on what part of the standard they implement, for example General, BiDi, or Script. A complete list of available properties can be found in the perlunicode manual page. This occasionally updates, as new properties and scripts are added to the Unicode specification.

To match properties in regular expressions, we use either the p{PROPERTY} notation for a positive match or P{PROPERTY} for a negative match, where PROPERTY is the Unicode property we want to match. For example, this matches a control character:

print "text contains a control character"
    if $string =˜ /p{Control}/;

General properties have short and long names, so Control can also be specified as Cc. Perl 5.6 also prefixed general property names with Is, and so for backwards compatibility, Perl 5.8 onwards allows either form to also be optionally prefixed with Is, so IsControl or IsCc are also valid ways to specify this property.

Many properties come in general and more specific variants, so for example, we can ask for a letter or specify an uppercase letter. Here are a few examples of standard Unicode properties, with a few general categories and some of their more specific variants:

  • Letter, L: Letter (equivalent to w)
    • UppercaseLetter, Lu: Uppercase letter
    • LowercaseLetter, Ll: Lowercase letter
  • Mark, M: Marked character (for example, accented)
  • Number, N: Number
    • DecimalNumber, Nd: Decimal number (equivalent to d)
  • Punctuation, P: Punctuation
    • DashPunctuation, Pd: Dash character
  • Symbol, S: Symbol character
  • MathSymbol, Sm: Mathematical symbol
  • Separator, Z: Separator
    • LineSeparator, Zl: Line separator
    • ParagraphSeparator, Zp: Paragraph separator
    • SpaceSeparator, Zs: Space separator
  • Other, C: Other character (none of L, M, N, P, or Z)
    • Cntrl, Cc: Control character

Single-letter properties like M can be specified without braces, so we can say equally p{M} or pM.

Extended properties implement combinations or aliases for standard Unicode properties and basic pattern elements. A few examples are

  • ASCII: Equivalent to the character class [x00-x7f]
  • Math: A mathematical symbol in any script
  • WhiteSpace: A whitespace character in any script

Script properties can be used to determine if a character belongs to a particular script:

# match on Cyrillic Script
if ($text =˜ /p{Cyrillic}/) { ... }

As with general properties, we can use an Is prefix, so IsCyrillic is also valid.

Block properties by contrast can be used to determine if a character belongs to a particular block. Since some blocks have the same names as scripts, block properties are prefixed with In. To test that a character is in the Cyrillic block, as opposed to the Cyrillic script, we use the following:

# match on Cyrillic block
if ($text =˜ /p{InCyrillic}/) { ... }

Finally, characters with different directional representations (Hebrew and Arabic are written right to left, for instance) can be matched on their bidirectional or Bidi properties, all of which start with Bidi. For example, the BidiL property matches characters normally written in a left-to-right order, while BidiR matches characters normally written in right-to-left order.

Custom Character Properties

New character properties can be implemented simply by defining a subroutine with an Is or In prefix in the main package. A property subroutine should return a text string defining the Unicode characters it matches, using either code ranges or other named properties, blocks, or scripts. For example:

#!/usr/bin/perl
use strict;
use warnings;
use charnames qw(:short);

sub IsCyrillicBlock {
return <<_END_;
0400 04FF
0500 0520
1D2B
_END_
}

my $string="N{Cyrillic:Psi}";

print "IsCyrillicBlock " if $string=˜/p{IsCyrillicBlock}/;

The number ranges correspond to the Cyrillic or Cyrillic Supplement blocks, plus the one lone character that happens to fall outside either block, so this property will match a character in either block plus the extra character. It is more extensive than the CyrillicScript property, because that only matches some of the characters in each block. In this case it happens that the blocks are contiguous, so we could also just have written

sub IsCyrillicBlock {
return <<_END_;
0400 0520
1D2B
_END_
}

Unlike built-in properties, however, the Is is not optional for these custom-made properties, and we must specify it in the property name for Perl to find our custom property.

We can also use named properties to describe what character codes we want to use. Any general, extended, script, or block property can be used. Additionally, we may negate a property to subtract one from another. This example uses named properties and subtraction to match the Cyrillic Script, but not any of the characters in the Cyrillic Supplement block:

sub IsScriptNotSupplement {
return <<_END_;
+utf8::IsCyrillic
-utf8::InCyrillicSupplement
_END_
}

To negate the overall sense of a property, we can of course use the P{...} notation, but we can also negate the property itself by adding an exclamation mark to the front of the first specification:

sub IsNotCyrillicBlock {
return <<_END_;
!0400 0520
-1D2B
_END_
}

The first range defines all possible characters except those in the range U+0400 to U+0520. Following it, we can add or subtract further ranges, character codes, or properties as usual using a plus or minus prefix. Here, we subtract U+1D2B to complete the negated version of the property definition.

We can refer to our own property subroutines by prefixing them with main:: rather than utf8::, so we can reuse them in other custom properties, such as this example, which matches any character in the blocks that is not included in the script:

sub IsBlockNotScript {
return <<_END_;
+main::IsCyrillicBlock
-utf8::IsCyrillic
_END_
}

We can also define custom transformations for the lc, lcfirst, uc, and ucfirst functions by creating subroutines with the names ToUpper, ToLower, and ToTitle. For the default mappings, see the definitions in the To subdirectory in the unicode/unicore directory.

Locale

A locale is a collection of contextual data that describes the preferred language, character set, and formatting styles for decimal numbers, dates, monetary values, and other quantities for a given location. By querying, the applications can transparently customize their output to conform to local conventions, whatever they might be. This process is called localization, abbreviated to L10N, because there are 10 letters between the "l" and "n." Locales are an important part of internationalization, similarly abbreviated to I18N, the process of making applications generate grammatically and contextually correct messages for different languages and countries.

Locale is supported by most Unix-like platforms, and it consists of a number of operating system libraries and a collection of data files that provides locale information for the locales supported on the platform. Any operating system that provides locale functionality also provides a default locale called the C locale, or sometimes the equivalent POSIX locale (these are essentially the same, but they are defined by different standards). Other locales may or may not be available, depending on whether the data files have been installed or not. On Windows, locales are handled differently, but fortunately modules available from CPAN enable some locale information to be handled the same way irrespective of how the platform manages it.

The locale system is divided into categories, each of which can be configured through a different environment variable. In addition, some variables are used as defaults for others if they happen not to be defined, and there are one or two additional variables supported by particular platforms. There are six primary categories, and the variables that control them are listed in Table 23-3.

Table 23-3. Locale System Categories

Category Variable Description
Collation LC_COLLATE Sorting (collation) order.
Character type LC_CTYPE Character set and encoding.
Messages LC_MESSAGE Language for OS and external library messages.
Currency LC_MONETARY Symbols and formatting for currency values.
Number formats LC_NUMERIC Formatting of numeric values (decimal point, thousands separator).
Dates and times LC_TIME Local format of time and date strings.
Global override LC_ALL Set all the preceding categories, overriding any more specific variable.
Global default LANG Set all the preceding categories, unless a more specific category variable is defined.
Messages override LANGUAGE Override LC_ALL for category LC_MESSAGE (for the GNU C library only).

In addition to these, we may also set the category variables LC_PAPER, LC_NAME, LC_ADDRESS, LC_TELEPHONE, LC_MEASUREMENT, and LC_IDENTIFICATION. These provide specialized information about preferred paper sizes, and the formatting of names, addresses, and telephone numbers.

In each case, the value of the variable is a locale specification indicating a language, a country, an optional character encoding, and an optional dialect. The language and country are taken from ISO 639, which defines the standard list of short codes for countries. In order to make it easy to discriminate them, the language is always lowercased, and the country is always uppercased. Here are some examples:

  • en: English
  • en_GB: British English
  • fr_BE: Belgian French
  • fr_CH: Swiss French
  • en_GB.iso885915: British English, ISO 8859-15 encoding
  • ru_RU.iso88595: Russian, ISO 8859-5 encoding
  • fr_BE@euro: Belgian French (Euro dialect)

These specifications are increasingly specific, with looser locales applying defaults where needed. For instance, while Russian has more than one character set encoding, ISO 8859-5 is the default. To override this and ask for the KO18R encoding instead, we would use ru_RU.ko18r. All of these locale settings are dependent on the operating system's understanding of locale and the locales it has installed.

Possibly the most important aspect of locale is the preferred language and (8-bit) character encoding. In these modern times, these aspects can now be more easily handled by Unicode, but it is still useful for a Unicode application to be aware of the locale's preferred encoding. Perl provides two ways to do this: first, the open pragma allows the locale to be used to determine the default encoding for filehandles with the :locale argument, as covered earlier in the chapter; second, the -C option (introduced in Perl 5.8) can be given an L argument to have the other arguments (O, E, and so on) enabled only if the locale supports UTF-8.

If all we are interested in doing is importing details of the locale's language and character set, then this is all the interaction we need with the locale system. If we want to make use of other aspects of the locale, such as the formatting of floating-point numbers, then we need to enable additional locale support.

Enabling Locales In Perl

To enable locale support in Perl, we merely need to invoke the locale pragma:

use locale;

This will tell Perl to use locale information, where possible and appropriate, in its built-in functions and modules. For this to work, however, the platform must not only support locales, but also Perl must believe that it does too. This is a build-time configuration issue, and we can check for locale support by looking for d_setlocale, either through the Config module or via the command line:

> perl -V:d_setlocale
d_setlocale='define';

The value of this parameter will be UNKNOWN if locales are not supported and define if they are.

Perl will read the currently configured locale, as described by the environment variables that control locale settings, use it to read in the data files for that locale, and configure the interpreter accordingly. It will also make the locale details available within the program, so we can query the locale for information beyond that which Perl can apply automatically.

If no appropriate locale is configured, the default locale, C, or POSIX, is used.

Availability Of Locales

To find out which locales the operating system supports, we can use the locale utility (available on most Unix platforms) to list them with locale -a. We can similarly extract a list of available character encodings with locale -m. (The exact arguments may vary from one OS to another.) By way of example, here are a few lines of output from running locale with the -a option, with all the different locales of the French language:


fr_BE
fr_BE@euro
fr_BE.utf8
fr_CA
fr_CA.utf8
fr_CH
fr_CH.utf8
fr_FR
fr_FR@euro
fr_FR.utf8
fr_LU
fr_LU@euro
fr_LU.utf8

Notice here that each locale has an 8-bit and a UTF-8 variant, and for countries that have joined the Euro, there is a euro dialect available too, which affects the currency symbol selected by the LC_MONETARY category. It is easy to differentiate these from their lowercased country code of fr. Similarly, here are all the official languages spoken in Switzerland, distinguished by the uppercased country code of CH:


de_CH
de_CH.utf8
fr_CH
fr_CH.utf8
it_CH
it_CH.utf8

We can also use the I18N::LangTags module to determine the legality of language tags (which are the language and country portion of the locale, but with a hyphen in place of an underscore). It can also derive alternate and fallback languages and convert locales to language tags:

#!/usr/bin/perl
# langtags.pl
use strict;
use warnings;

use I18N::LangTags qw(
    implicate_supers
    locale2language_tag
    is_language_tag
);

# check for valid syntax (does not look up)
print "Tag ok " if is_language_tag("fr-CH");

# extract the language tag from a locale - returns 'ru-RU'
print "Tag is ",locale2language_tag("ru_RU.ko18r"), " ";

# expand a list of tags to include generic supersets
my @expanded=implicate_supers('fr-CH','de-CH'),
print "Expanded tags: @expanded ";

This last example would print the following:


fr-CH fr de-CH de

Windows does not support Unix-style locale, but information can be derived from other sources to generate some of the same data. In particular, the Win32::Locale module available from CPAN provides various functions to extract locale information from the operating system, even though Windows handles many aspects of locale very differently from Unix and other platforms:

print "Windows 'locale' is equivalent to ",
    Win32::Locale::get_locale()," ";

While it is directly usable, Win32::Locale is more useful to have installed so that other locale-oriented modules may make use of it if they detect it is present.

Another useful module is I18N::LangTags::Detect, which attempts to infer the user's preferred language from not just locale variables, but also other possible sources such as CGI environment variables like HTTP_ACCEPT_LANGUAGE. It will also use Win32::Locale on Windows systems (if it is installed). Here is one way to use it to derive a list of acceptable languages:

my @ok_langs = I18N::LangTags::implicate_supers(
    I18N::LangTags::Detect::detect()
);

Also provided in the I18::LangTags:: family is I18N::LangTags::List. This provides a mapping for each language tag to an English name, which can then be further used with modules like Locale::Language, described later in the chapter.

How Locales Affect Perl

Assuming that Perl is able to parse the configured locale and load the locale data, programs that invoke the locale pragma become transparently aware of the locale.

  • LC_COLLATE controls the behavior of the sort function without a sort subroutine and the le, lt, ge, gt and cmp operators.
  • LC_CTYPE controls the characters that are considered letters and therefore the characters matched by the regular expression metacharacter w. It may also affect the operation of lc, lcfirst, uc, and ucfirst.
  • LC_TIME has no effect in Perl, but it does affect the POSIX strftime function.
  • LC_MESSAGE and LANGUAGE have no effect in Perl, but they may affect the text of error messages returned into $! by C libraries called via extension modules.
  • LC_NUMERIC controls the decimal point character used by print, printf, sprintf, and write.

    Additionally, and irrespective of whether the locale pragma has been used or not:

  • Report formats, as invoked through the write function, always use the current locale (see LC_NUMERIC earlier).

If Perl sees locale variables in the environment but cannot deduce a correct locale from them, it will generate a warning message and fall back to the default locale, C or POSIX. To disable this warning, we can set the PERL_BADLANG environment variable to a false value (an empty string or zero):

> PERL_BADLANG=0 perl mylocaleapp.pl

However, a bad locale setting may indicate a more serious problem that should really be fixed rather than worked around. In particular, working around a bad locale in Perl will not help C library functions that access locale information.

Locale And Tainting

When locale support is enabled in Perl through the locale pragma, all functions and operators that are influenced by locale, and which return strings as their result, mark those strings as tainted. This is because a badly (or maliciously) configured locale can cause bizarre behavior, for example, by modifying the date and time formats used by the POSIX strftime function to use literal numbers where the values should be.

The tainting effects of locale include interpolated strings that use the case-mapping metacharacters l, L, u, and U, and their functional equivalents lc, lcfirst, uc, and ucfirst. It also taints extracted text from matches on patterns using the w regular expression metacharacter—including special variables like $& as well as parentheses—and the original string after modification by a substitution using w. This can be important, since extracting text from a regular expression is the primary means for untainting tainted data. To successfully untaint (assuming of course that this is safe to do), use a pattern that does not involve w.

Getting And Setting The Locale

Rather than have Perl read locale configuration information from the environment, we can request a specific locale with the setlocale function. This is a standard library function provided by the operating system and accessed from Perl through the POSIX module. Here is how we can use it to switch four locale categories to different locales:

#!/usr/bin/perl
use strict;
use warnings;
use locale;
use POSIX 'locale_h';
setlocale(LC_COLLATE, 'fr_CH'),
setlocale(LC_NUMERIC, 'en_US'),
setlocale(LC_MONETARY, 'fr_CH'),
setlocale(LC_TIME, 'fr_FR'),

Slightly counterintuitively, the setlocale function is also used to find out what the locale is currently set to for a given category. To do this, we provide it with only one argument. For example, to find out the current LC_MONETARY locale:

print "The monetary locale is ",setlocale(LC_MONETARY), " ";

Although it looks similar, this is not the same as looking up the corresponding environment variable. First, this returns the derived value after all of the locale environment variables have been taken into account. Second, if we feed LC_ALL to this function, we get back a long semicolon-delimited list of all locale settings:


LC_CTYPE=en_GB;LC_NUMERIC=C;LC_TIME=en_GB;LC_COLLATE=C;
LC_MONETARY=en_GB;LC_MESSAGES=en_GB;LC_PAPER=en_GB;LC_NAME=en_GB;
LC_ADDRESS=en_GB;LC_TELEPHONE=en_GB;LC_MEASUREMENT=en_GB;
LC_IDENTIFICATION=en_GB

Since this is a report of the actual locale settings, it does not reflect how they were set, so we do not see LC_ALL here, nor LANG or LANGUAGE, even if they were used to establish the locale.

Querying The Current Locale

Perl can only deduce appropriate behavior for a small subset of locale-dependent situations, and so there are many more locale settings than Perl is able to make use of automatically. For instance, Perl cannot know when we ask to print a dollar sign whether we are referring to currency or some other use of the symbol.

To find the correct currency symbol for the current locale, we need to query the locale ourselves with the localeconv function provided by the operating system, by way of the POSIX module. The localeconv function returns a reference to a hash of key-value pairs, of which currency_symbol is the one of interest to us. Here is a command that sets the locale in the environment and then asks localeconv for the currency symbol, via Perl:

> LC_ALL=fr_FR perl -MPOSIX=localeconv -e "print localeconv()->{currency_symbol}"

The symbol may not actually turn out to be a symbol, of course. For example, the preceding command will print EUR since the fr_FR locale does not know about the Euro symbol. If we use fr_FR@euro instead, the locale switches to a character encoding that includes the Euro symbol and will duly print the character corresponding to the Euro character code instead (whether we actually get a Euro or not depends on the encoding supported by the terminal, of course).

localeconv, as its name suggests, returns conversion information defined by the LC_MONETARY and LC_NUMERIC categories. Since it queries the locale directly, it does not rely on Perl understanding the locale implicitly, and so it does not need the locale pragma to be invoked in order to work. This short program prints out all the settings that localeconv can access:

#!/usr/bin/perl
# localeconv.pl
use strict;
use warnings;
use POSIX qw(localeconv);

my $conv = localeconv();
for my $var (sort keys %$conv) {
    printf "%17s => %s ", $var, $conv->{$var};
}

This is what this program prints out under the Swiss French locale:

> LC_ALL=fr_CH localeconv.pl


  currency_symbol => Fr.
    decimal_point => .
      frac_digits => 2
  int_curr_symbol => CHF
  int_frac_digits => 2
mon_decimal_point => .
     mon_grouping =>
mon_thousands_sep =>
    n_cs_precedes => 1
   n_sep_by_space => 1
      n_sign_posn => 4
    negative_sign => -
    p_cs_precedes => 1
   p_sep_by_space => 1
      p_sign_posn => 4

We can also use the I18N::LangInfo module to query the locale for day and month names, the local words for "yes" and "no," and the preferred format for date and time strings. I18N::LangInfo accesses the parts of the locale controlled by LC_TIME and LC_MESSAGE and provides one function, langinfo, plus a collection of symbol names for values that we can query. The available values are system dependent, but they usually include DAY1 to DAY7, MON1 to MON12 for day and month names and their AB-prefixed abbreviations. The full list can usually be found in the /usr/include/localinfo.h header.

By way of an example, this short program prints out all the day and month names, plus the local words for, and the regular expression to match valid input for, "Yes" and "No":

#!/usr/bin/perl
# langinfo.pl
use strict;
use warnings;
use I18N::Langinfo qw(
    /DAY/ /MON/
    YESSTR NOSTR YESEXPR NOEXPR CODESET
    langinfo
);

print "Code set = ",langinfo(CODESET)," ";
print "Yes='",langinfo(YESSTR),"' regex='",langinfo(YESEXPR),"' ";
print " No='",langinfo(NOSTR) ,"' regex='",langinfo(NOEXPR),"' ";

no strict 'refs';
foreach my $day (1..7) {
    print "Day $day is ", langinfo(&{'DAY_'.$day}),
      " (",langinfo(&{'ABDAY_'.$day}),") ";
}
foreach my $month (1..12) {
    print "Month $month is ", langinfo(&{'MON_'.$month}),
      " (",langinfo(&{'ABMON_'.$month}),") ";
}

Given a locale like fr_FR@euro, this program will print out the following:

> LC_ALL=fr_FR@euro ./langinfo.pl


Code set = ISO-8859-15
Yes='Oui' regex='^[oOyY].*'
 No='Non' regex='^[nN].*'
Day 1 is dimanche (dim)
Day 2 is lundi (lun)
Day 3 is mardi (mar)
Day 4 is mercredi (mer)
Day 5 is jeudi (jeu)
Day 6 is vendredi (ven)
Day 7 is samedi (sam)
Month 1 is janvier (jan)
Month 2 is février (fév)
Month 3 is mars (mar)
Month 4 is avril (avr)
Month 5 is mai (mai)
Month 6 is juin (jun)
Month 7 is juillet (jui)
Month 8 is août (aoû)
Month 9 is septembre (sep)
Month 10 is octobre (oct)
Month 11 is novembre (nov)
Month 12 is décembre (déc)

Notice that because we asked for the euro dialect of fr_FR, the character set switched to ISO 8859-15, which includes this symbol. Without the dialect, we would have gotten ISO 8859-1. Like localeconv, I18N::LangInfo is independent of Perl's built-in locale support, and we do not need to use the locale pragma in order to make use of it.

Perl's standard library also supplies a number of modules in the Locale:: family. The most important of these, Locale::MakeText, is covered later in the section "Internationalization." The others all provide a map of English names to standard abbreviations for country, language, currency, and character scripts, each defined by a different ISO standard, along with importable subroutines to access the data in each table. Here is a summary of the available modules:

Locale::Country ISO 3166-1 code2country country2code country_code2code all_country_codes all_country_names
Locale::Currency ISO 4217 code2currency currency2code all_currency_codes all_currency_names
Locale::Language ISO 632 code2language language2code all_language_codes all_language_names
Locale::Script ISO 15924 code2script script2code script_code2code all_script_codes all_script_names
I18N::LangTags::List RFC 3066 name

For example, we can list out all the currency codes and their (English) names, as defined by ISO 4217, with this short script:

#!/usr/bin/perl
# currencies.pl
use strict;
use warnings;
use Locale::Currency qw(all_currency_codes code2currency);

my @codes=all_currency_codes();
print map {
    "'$_' is the ".code2currency($_)." "
} @codes;

This produces output starting with


'adp' is the Andorran Peseta
'aed' is the UAE Dirham
'afa' is the Afghani
'all' is the Lek
'amd' is the Armenian Dram

Additionally, Locale::Constants provides constants for use in the Locate::Country and Locale::Script modules, which know how to convert between the 2-letter, 3-letter, and numeric conventions for country and script codes:

#!/usr/bin/perl
# scripts.pl
use strict;
use warnings;
use Locale::Script qw(
    all_script_names script2code
    LOCALE_CODE_ALPHA_2 LOCALE_CODE_ALPHA_3 LOCALE_CODE_NUMERIC
);

my @scripts = all_script_names();
foreach my $script (sort @scripts) {
    printf "%-37s: 2-ltr='%s' 3-ltr='%s' numeric='%d' ", $script,
      script2code($script => LOCALE_CODE_ALPHA_2),
      script2code($script => LOCALE_CODE_ALPHA_3),
      script2code($script => LOCALE_CODE_NUMERIC);
};

This program generates a listing with lines like these:

Cypro-Minoan        : 2-ltr='cm' 3-ltr='cmn' numeric='402'
Cyrillic                  : 2-ltr='cy' 3-ltr='cyr' numeric='220'
Deserel (Mormon)          : 2-ltr='ds' 3-ltr='dsr' numeric='250'
Devanagari (Nagari)       : 2-ltr='dv' 3-ltr='dvn' numeric='315'
Egyptian demotic          : 2-ltr='ed' 3-ltr='egd' numeric='70'

Also in the preceding table is I18N::LangTags::List, which provides a single routine, name, that provides the English language name for a language tag. It provides very similar results to Locale::Language and also draws on definitions originating outside the ISO 639 standard. The full list of language codes it recognizes is defined in the manual page, along with the standard from which they originate. The returned language names can in general then be converted to a different form using the language2code subroutine of Locale::Language.

Internationalization

Locales, with or without the help of Unicode, are an essential first step towards making an application or module aware of the language and country of its user, but they cannot provide more than basic help with the problem of communicating with a user in their native language. Issues of grammar, sentence structure, and basic divergences of language use mean that it is rarely possible to simply translate a message into the desired target languages and then print it out according to the locale. The process of fully educating an application to provide grammatically correct and context-sensitive messages is known as internationalization, or I18N for short, because there are 18 letters between the first "i" and the last "n."

The Locale::Maketext module provides a better solution to the problem of internationalization. It gives us the ability to create catalogs of messages, also called lexicons, one for each language that we intend to support. A message can be a literal text string, a text string with substitutions, or a subroutine that can implement more advanced behavior in those situations that really demand it. It is aware of locales and can recognize and configure itself from environment variables like HTTP_ACCEPT_LANGUAGE and will make use of the Win32::Locale module to derive locale information on Windows platforms.

Configuring For Internationalization

The first step to setting up a language lexicon is to define a subclass of Locale::Maketext. This should generally be within the namespace of the module or module family to be internationalized, so if the core module of an application is called My::Application, we would choose a name like My::Application::Lexicon or My::Application::I18N. The basic contents of this module are minimal:

package My::Application::I18N;
use base qw(Locale::Maketext);

1;

The real work is in the lexicon modules. We create one for each locale, inheriting from the preceding module, and named for the lowercased locale. Within each lexicon, message keys and values are provided in a hash variable called %Lexicon. For example, the US English lexicon might look like this:

package My::Application::I18N::en_us;
use base qw(My::Application::I18N);

%Lexicon=(
    Welcome => "Welcome!";
    ...
);

1;

If we don't want to define lexicons for multiple English locales, we can use just the language rather than the language plus country, in this case en rather than en_us. If Locale::Maketext cannot find a module for the precise locale, it will look for generic language lexicons as a substitute. This means we can define a general-purpose English lexicon and then define specific lexicons for particular dialects as needed.

Lexicon modules can also inherit from other lexicons through the usual @ISA mechanism. If Locale::Maketext sees that a lexicon has parent classes, they are also searched for %Lexicon hash variables if the child class does not supply a requested message. The parents' ancestors are searched in turn, until a message is found or all possible locations for the message turn up empty. So we can, for example, define an en_gb lexicon with British customizations of a general-purpose English lexicon written in en_us. This also makes it very simple to create catalogs of shared messages for use in many different applications.

If the default fallback behavior is not enough, we can define our own list of fallback languages by passing a list of language tags and/or languages to the fallback_languages class method, which we can invoke from our master lexicon class. For example, to fall back to English from any locale, including non-English ones:

My::Application::I18N->fallback_languages("en-US","en");

Notice that the arguments are language tags, identical to the short form of a locale except that a hyphen (minus sign) is used rather than an underscore. (As the use of language tags implies, the fallback functionality is courtesy of the I18N::LangTags module, covered earlier in the chapter.)

We can also, if we prefer, instantiate a lexicon object and pass it to fallback_language_classes:

my $fallback_en=new My::Application::I18N::en->new();
My::Application::I18N->fallback_language_classes($fallback_en);

This has the advantage over fallback_languages in that once the lexicon object is instantiated it cannot fail to be used in the event that no other lexicon can be found, since we have already created it.

Using Internationalization

To use the newly defined lexicons in a module or application, we create a lexicon object with the get_handle method. With an argument like en or fr, this will generate a lexicon for the requested language. But it is more interesting when no argument is supplied—in this case, the locale is inspected and variables like HTTP_ACCEPT_LANGUAGE are consulted to infer the correct language from context. For example:

use My::Application::I18N;
my $lexicon=My::Application::I18N->get_handle();
die "No language support available! "
  unless $lexicon;

Once a lexicon object has been found and instantiated, we can extract messages from it through their key with the maketext method. This is the core method of Locale::Maketext, because it is the mechanism by which the lexicons we define are invoked to return locale-dependent messages. For example, to extract the simple welcome message earlier, we could use

print $lexicon->maketext("Welcome")," ";

As the language lexicon represented by $lang is determined by the locale, this will print Welcome in any English locale, assuming we defined a lexicon for en rather than en_us; otherwise, the lookup will work for US English but fail for other variants like British or Canadian English. Similarly, we can define a de lexicon to return Wilkommen in German locales. Any languages not directly supported are handled through the fallback mechanism described previously.

Writing Messages

The L10N's work of internationalization is, of course, writing the actual messages. For simple messages with no variable components, a simple string will do. But more often, we need to generate a message that varies according to context, with placeholders taking the place of values to be substituted in context. For example, take this message returned by a theoretical search engine:

Found 16 results for 'Perl'

There are two places in this message where a value needs to be filled in. To mark up places in a message that need to be expanded when the message is generated, Locale::Maketext offers bracket notation, where variable text is marked by square brackets. The simplest use of bracket notation is to provide numbered placeholders for arguments passed to the maketext method. When the message is retrieved from the lexicon, the values are placed into the message string before it is returned. For example, this would be the English language value for the preceding message:

Found [_1] results for '[_2]'

The key that triggers this message can be anything we like, as it serves only to retrieve the message value. For example, if the key we used was found_results, the en lexicon would have a key-value pair like this:

%Lexicon = (
    found_results => "Found [_1] results for '[_2]'",
    ...
);

And it would be retrieved in the program with the following:

print $lexicon->maketext(found_results => 16, 'Perl'),

We would then complete the job by adding all the appropriate translations to the other language lexicons we intend to support. An alternative choice of key is to use the message value for the "primary" language (most likely the language of the developer writing the program):

print $lexicon->maketext("Found [_1] results for '[_2]' => 16, 'Perl'),

This makes the code more legible, since we can see the bracket notation in the key and see, at least for one language, what the result will look like without having to run the program to find out. It also comes in handy if we set up a lexicon to autogenerate missing message values from the key, as described later.

Parameter placeholders work like array indices, with negative numbers counting from the end, so '[_-1]' would expand into the last argument passed. '[_*]' is a special case—it expands into the whole list. If we actually want to use a literal left square bracket, we can escape it with a tilde (not a backslash): ˜[. To get a literal tilde, double it up: ˜˜.

The more interesting use of bracket notation is to invoke a method, which we can do by simply naming it after the opening bracket. Any parameters we want to pass to it come after it, separated with commas, before the closing bracket. The method can be defined anywhere in the lexicon class or a superclass, depending on whether we need it in only one lexicon or in several. In our example, My::Application::I18N::en and My::Application::I18N are both valid places for a method to be defined.

A common requirement for special handling is correctly pluralizing a noun, and the Locale::Maketext module (which lexicons all inherit from) provides a general-purpose method quant to handle this problem, at least for most Western languages. Here is how we can use it in bracket notation:

Found [quant,_1,result] results for '[_2]'

Found [quant,_1,result,results] results for '[_2]'

The first version selects between the singular and plural of result based on the value of the first argument passed (16 in our earlier example). The second version simply appends an "s" if the quantity is plural, which happens to be correct for "result" but, of course, would not work for "sheep" or "bunny." In all cases, whitespace is significant, so do not be tempted to use spaces around the commas.

We might also want to handle the special case of zero matches, which we can do with a fourth argument:

Found [quant,_1,result,results,no results] results for '[_2]'

The quant method is used sufficiently frequently that it can also be abbreviated to an asterisk in bracket notation:

Found [*,_1,result,results,no results] results for '[_2]'

Locale::Maketext supplies a small handful of other useful methods too:

encoding The character encoding of the current locale, for example: $lexicon->encoding().
language_tag The language tag of the current locale, for example: $lexicon->language_tag().
numf Format the supplied number according to the local conventions for the decimal points and thousands separator, for example: [numf,12345.678].
sprintf A direct interface to Perl's sprintf function, to make use of sprintf-style format strings, for example: [sprintf,"%20s : %-20s",_1,_2].

All of these methods can be called in code or via bracket notation, but as the example suggests, numf and sprintf are more likely candidates for the latter. Like quant, numf also has a shorthand name, #, so the example in the table could also be written as

[#,12345.678]

If all else fails, we can assign a code reference to a method to be called directly by the maketext method. For example, for a lexicon entry that returns a correctly constructed "results" phrase, we could use this:

"[_1] results" => &count_results

This is just a more direct (and more efficient) way of saying

"[_1] results" => "[count_results,_*]"

Here is one way this method might be implemented. It just calls quant to return the correct phrase for a number of results, and illustrates that quant and other methods like it can be just as easily called in code as through bracket notation:

sub count_results {
    my ($lexicon,$count)=@_;
    return $lexicon->quant($count,"result","results","no results");
}

Intriguingly, there is nothing to stop us calling maketext itself from one of these subroutines, so we can devise very intricate mechanisms for handling even the trickiest scenarios.

Handling Failed Lookups

In the event that a message is missing in the lexicon of a particular language, Locale::Maketext will ordinarily die with a fatal error. However, if a lexicon contains the special key fail, which can be set explicitly or via the fail_with method, the named method is invoked as an error handler. Depending on our needs, we can define this method in the language lexicon module itself or place it in the master module to be invoked from all lexicons. For example, to install the handler at runtime, we can use

$lexicon->fail_with("lexicon_failure");

Or in the master module:

sub lexicon_failure {
    my ($lexicon, $key, @params)=@_;
    return $fallback_en->maketext($key => @params);
}

This particular error handler calls the English fallback lexicon we created earlier, and provides an alternative way to manage fallback situations to the fallback_languages and fallback_language_classes methods. In a real error handler, we would also likely log the failed lookup so we can take preventative action in a future release.

A convenient error handler called failure_handler_auto is actually provided by Locale::Maketext itself:

$lexicon->fail_with("failure_handler_auto");

This handler will try to make sense of the requested key, compile it into a value to evaluate any bracket notation if possible, or just return the key as a last resort. Every failed lookup is recorded in a hash reference assigned to the key failure_lex in the lexicon in which the lookup failed. This can be dumped out or otherwise recorded, for example, in an END block, to get a log of all failed lookups during the lifetime of the application.

We can also have Locale::Maketext automatically handle missing messages by defining them with an automatic lexicon.

Automatic Lexicons

If the %Lexicon hash contains the special hash-key pair _AUTO => 1, then missing messages in the lexicon are invented by creating a new entry with the lookup key as the value:

%Lexicon = (
    _AUTO => 1,
    ...
);

While this is no substitute for actually defining the message in each supported language, it does allow for a fallback message rather than generating an error. However, it bypasses the regular fallback mechanism, and it does not even consult lexicons in superclasses. It is most useful for application development, as it frees us from the need to define the lexicon as we develop the application—we only need to make calls to the maketext method as usual, and the application will use the keys we supply as the message values for all languages, until such time as we pass the lexicon to translators to fill in the values for the other languages we intend to support.

Summary

In this chapter, we looked at Perl's support for character encodings and Unicode in particular. We saw how Perl implements wide characters, how to handle strings and filehandles that are wide-character enabled, and how to convert between Unicode and other native character encodings. We also looked at Unicode character names and querying the Unicode database for details of blocks, scripts, and character properties.

We then turned to the subject of localization, or L10N, and locales, collections of related contextual information that describe the preferred character set of different languages and locations; the preferred local format of dates, times, and numbers; the name of the local currency; names of days and months; and even the local words for "yes" and "no." Locales are well supported in Perl on Unix systems. For Windows, the Win32::Locale module can be installed to provide much of the same information.

Finally, we took a brief look at internationalization, or I18N, the process of writing applications so that they are fully aware of their locale and can communicate with the user in their native language. This is trickier than it might at first seem, since different languages have very different ideas about sentence structure, whether zero is singular or plural (or neither), and a myriad of other concerns. In Perl, we can use Locale::Maketext, a flexible module that enables us to define catalogs or lexicons of messages and, where necessary, message-returning subroutines to generate context-sensitive responses in multiple languages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset