If you do not yet know what Unicode is, you will soon--even if you skip reading this chapter--because working with Unicode is becoming a necessity. (Some people think of it as a necessary evil, but it's really more of a necessary good. In either case, it's a necessary pain.)
Historically, people made up character sets to reflect what they needed to do in the context of their own culture. Since people of all cultures are naturally lazy, they've tended to include only the symbols they needed, excluding the ones they didn't need. That worked fine as long as we were only communicating with other people of our own culture, but now that we're starting to use the Internet for cross-cultural communication, we're running into problems with the exclusive approach. It's hard enough to figure out how to type accented characters on an American keyboard. How in the world (literally) can one write a multilingual web page?
Unicode is the answer, or at least part of the answer (see also XML). Unicode is an inclusive rather than an exclusive character set. While people can and do haggle over the various details of Unicode (and there are plenty of details to haggle over), the overall intent is to make everyone sufficiently happy[1] with Unicode so that they'll willingly use Unicode as the international medium of exchange for textual data. Nobody is forcing you to use Unicode, just as nobody is forcing you to read this chapter (we hope). People will always be allowed to use their old exclusive character sets within their own culture. But in that case (as we say), portability suffers.
The Law of Conservation of Suffering says that if we reduce the suffering in one place, suffering must increase elsewhere. In the case of Unicode, we must suffer the migration from byte semantics to character semantics. Since, through an accident of history, Perl was invented by an American, Perl has historically confused the notions of bytes and characters. In migrating to Unicode, Perl must somehow unconfuse them.
Paradoxically, by getting Perl itself to unconfuse bytes and characters, we can allow the Perl programmer to confuse them, relying on Perl to keep them straight, just as we allow programmers to confuse numbers and strings and rely on Perl to convert back and forth as necessary. To the extent possible, Perl's approach to Unicode is the same as its approach to everything else: Just Do The Right Thing. Ideally, we'd like to achieve these four Goals:
Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on.
Old byte-oriented programs should magically start working on the new character-oriented data when appropriate.
Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode.
Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl.
Taken together, these Goals are practically impossible to reach. But we've come remarkably close. Or rather, we're still in the process of coming remarkably close, since this is a work in progress. As Unicode continues to evolve, so will Perl. But our overarching plan is to provide a safe migration path that gets us where we want to go with minimal casualties along the way. How we do that is the subject of the next section.
In releases of Perl prior to 5.6, all strings were
viewed as sequences of bytes.[2] In versions 5.6 and later, however, a string may contain
characters wider than a byte. We now view strings not as sequences of
bytes, but as sequences of numbers in the range 0 ..
2**32-1
(or in the case of 64-bit computers, 0 ..
2**64-1
). These numbers represent abstract characters, and
the larger the number, the "wider" the character, in some sense; but
unlike many languages, Perl is not tied to any particular width of
character representation. Perl uses a variable-length encoding (based
on UTF-8), so these abstract character numbers may, or may not, be
packed one number per byte. Obviously, character number
18,446,744,073,709,551,615
(that is,
"x{ffff_ffff_ffff_ffff}
") is never going to fit
into a byte (in fact, it takes 13 bytes), but if all the characters in
your string are in the range 0..127
decimal, then
they are certainly packed one per byte, since UTF-8 is the same as
ASCII in the lowest seven bits.
Perl uses UTF-8 only when it thinks it is beneficial,
so if all the characters in your string are in the range
0..255
, there's a good chance the characters are
all packed in bytes--but in the absence of other knowledge, you can't
be sure because internally Perl converts between fixed 8-bit
characters and variable-length UTF-8 characters as necessary. The
point is, you shouldn't have to worry about it most of the time,
because the character semantics are preserved at an abstract level
regardless of representation.
In any event, if your string contains any character
numbers larger than 255
decimal, the string is
certainly stored in UTF-8. More accurately, it is stored in Perl's
extended version of UTF-8, which we call utf8, in
honor of a pragma by that name, but mostly because it's easier to
type. (And because "real" UTF-8 is only allowed to contain character
numbers blessed by the Unicode Consortium. Perl's utf8 is allowed to
contain any character numbers you need to get your job done. Perl
doesn't give a rip whether your character numbers are officially
correct or just correct.)
We said you shouldn't worry about it most of the time, but people like to worry anyway. Suppose you use a v-string to represent an IPv4 address:
$locaddr = v127.0.0.1; # Certainly stored as bytes. $oreilly = v204.148.40.9; # Might be stored as bytes or utf8. $badaddr = v2004.148.40.9; # Certainly stored as utf8.
Everyone can figure out that $badaddr
will
not work as an IP address. So it's easy to think that if O'Reilly's
network address gets forced into a UTF-8 representation, it will no
longer work. But the characters in the string are abstract numbers,
not bytes. Anything that uses an IPv4 address, such as the
gethostbyaddr
function, should automatically coerce
the abstract character numbers back into a byte representation (and
fail on $badaddr
).
The interfaces between Perl and the real world have to
deal with the details of the representation. To the extent possible,
existing interfaces try to do the right thing without your having to
tell them what to do. But you do occasionally have to give
instructions to some interfaces (such as the open
function), and if you write your own interface to the real world, it
will need to be either smart enough to figure things out for itself or
at least smart enough to follow instructions when you want it to
behave differently than it would by default.[3]
Since Perl worries about maintaining transparent character
semantics within the language itself, the only place you need to worry
about byte versus character semantics is in your interfaces. By
default, all your old Perl interfaces to the outside world are
byte-oriented, so they produce and consume byte-oriented data. That is
to say, on the abstract level, all your strings are sequences of
numbers in the range 0..255
, so if nothing in the
program forces them into utf8 representations, your old program
continues to work on byte-oriented data just as it did before. So put
a check mark by Goal #1 above.
If you want your old program to work on new
character-oriented data, you must mark your character-oriented
interfaces such that Perl knows to expect character-oriented data from
those interfaces. Once you've done this, Perl should automatically do
any conversions necessary to preserve the character abstraction. The
only difference is that you've introduced some strings into your
program that are marked as potentially containing characters higher
than 255
, so if you perform an operation between a
byte string and utf8 string, Perl will internally coerce the byte
string into a utf8 string before performing the operation. Typically,
utf8 strings are coerced back to byte strings only when you send them
to a byte interface, at which point, if the string contains characters
larger than 255
, you have a problem that can be
handled in various ways depending on the interface in question. So you
can put a check mark by Goal #2.
Sometimes you want to mix code that understands
character semantics with code that has to run with byte semantics,
such as I/O code that reads or writes fixed-size blocks. In this case,
you may put a use bytes
declaration around the
byte-oriented code to force it to use byte semantics even on strings
marked as utf8 strings. You are then responsible for any necessary
conversions. But it's a way of enforcing a stricter local reading of
Goal #1, at the expense of a looser global reading of Goal #2.
Goal #3 has largely been achieved, partly by doing lazy conversions between byte and utf8 representations and partly by being sneaky in how we implement potentially slow features of Unicode, such as character property lookups in huge tables.
Goal #4 has been achieved by sacrificing a small amount of interface compatibility in pursuit of the other Goals. By one way of looking at it, we didn't fork into two different Perls; but by another way of looking at it, revision 5.6 of Perl is a forked version of Perl with regard to earlier versions, and we don't expect people to switch from earlier versions until they're sure the new version will do what they want. But that's always the case with new versions, so we'll allow ourselves to put a check mark by Goal #4 as well.
[1] Or in some cases, insufficiently unhappy.
[2] You may prefer to call them "octets"; that's okay, but we think the two words are pretty much synonymous these days, so we'll stick with the blue-collar word.
[3] On some systems, there may be ways of switching all your
interfaces at once. If the -C
command-line
switch is used, (or the global
${^WIDE_SYSTEM_CALLS}
variable is set to
1
), all system calls will use the corresponding
wide character APIs. (This is currently only implemented on
Microsoft Windows.) The current plan of the Linux community is
that all interfaces will switch to UTF-8 mode if
$ENV{LC_CTYPE}
is set to
"UTF-8
". Other communities may take other
approaches. Our mileage may vary.