In a pattern match, you may match any character that
has--or that does not have--a particular property. There are four ways
to specify character classes. You may specify a character classes in
the traditional way using square brackets and enumerating the possible
characters, or you may use any of three mnemonic shortcuts: the
classic Perl classes, the new Perl Unicode properties, or the standard
POSIX classes. Each of these shortcuts matches only one character from
its set. Quantify them to match larger expanses, such as
d+
to match one or more digits. (An easy mistake
is to think that w
matches a word. Use
w+
to match a word.)
An enumerated list of characters in square brackets
is called a character class and matches any one
of the characters in the list. For example,
[aeiouy]
matches a letter that can be a vowel in
English. (For Welsh add a "w
", for Scottish an
"r
".) To match a right square bracket, either
backslash it or place it first in the list.
Character ranges may be indicated using a hyphen and
the a-z
notation. Multiple ranges may be
combined; for example, [0-9a-fA-F]
matches one
hex "digit". You may use a backslash to protect a hyphen that would
otherwise be interpreted as a range delimiter, or just put it at the
beginning or end of the class (a practice which is arguably less
readable but more traditional).
A caret (or
circumflex, or hat, or up arrow) at the front of the character class
inverts the class, causing it to match any single character
not in the list. (To match a caret, either
don't put it first, or better, escape it with a
backslash.) For example, [^aeiouy]
matches any
character that isn't a vowel. Be careful with character class
negation, though, because the universe of characters is expanding.
For example, that character class matches consonants--and also
matches spaces, newlines, and anything (including vowels) in
Cyrillic, Greek, or nearly any other script, not to mention every
idiograph in Chinese, Japanese, and Korean. And someday maybe even
Cirth, Tengwar, and Klingon. (Linear B and Etruscan, for sure.) So
it might be better to specify your consonants explicitly, such as
[cbdfghjklmnpqrstvwxyz]
, or
[b-df-hj-np-tv-z]
for short. (This also solves
the issue of "y" needing to be in two places at once, which a set
complement would preclude.)
Normal character metasymbols are supported inside a
character class, (see "Specific Characters"), such as
,
,
c
X
,
NNN
, and
N{
NAME
}
.
Additionally, you may use within a character
class to mean a backspace, just as it does in a double-quoted
string. Normally, in a pattern match, it means a word boundary. But
zero-width assertions don't make any sense in character classes, so
here
returns to its normal meaning in strings.
You may also use any predefined character class described later in
the chapter (classic, Unicode, or POSIX), but don't try to use them
as endpoints of a range--that doesn't make sense, so the
"
-
" will be interpreted literally.
All other metasymbols lose their special meaning
inside square brackets. In particular, you can't use any of the
three generic wildcards: ".", X
, or
C
. The first often surprises people, but it
doesn't make much sense to use the universal character class within
a restricted one, and you often want to match a literal dot as part
of a character class--when you're matching filenames, for instance.
It's also meaningless to specify quantifiers, assertions, or
alternation inside a character class, since the characters are
interpreted individually. For example,
[fee|fie|foe|foo]
means the same thing as
[feio|]
.
Since the beginning, Perl has provided a number of
character class shortcuts. These are listed in Table 5.8. All of them are
backslashed alphabetic metasymbols, and in each case, the uppercase
version is the negation of the lowercase version. The meanings of
these are not quite as fixed as you might expect; the meanings can
be influenced by locale settings. Even if you don't use locales, the
meanings can change whenever a new Unicode standard comes out,
adding scripts with new digits and letters. (To keep the old byte
meanings, you can always use bytes
. For
explanations of the utf8 meanings, see "Unicode Properties" later in
this chapter. In any case, the utf8 meanings are a superset of the
byte meanings.)
Table 5-8. Classic Character Classes
Symbol | Meaning | As Bytes | As utf8 |
---|---|---|---|
d | Digit | [0-9] | p{IsDigit} |
D | Nondigit | [^0-9] | P{IsDigit} |
s | Whitespace | [
f] | p{IsSpace} |
S | Nonwhitespace | [^
f] | P{IsSpace} |
w | Word character | [a-zA-Z0-9_] | p{IsWord} |
W | Non-(word character) | [^a-zA-Z0-9_] | P{IsWord} |
(Yes, we know most words don't have numbers or underscores in
them; w
is for matching "words" in the sense of
tokens in a typical programming language. Or Perl, for that
matter.)
These metasymbols may be used either outside or inside square brackets, that is, either standalone or as part of a constructed character class:
if ($var =~ /D/) { warn "contains non-digit" } if ($var =~ /[^ws.]/) { warn "contains non-(word, space, dot)" }
Unicode properties are available using
p{
PROP
}
and its set complement,
P{
PROP
}
.
For the rare properties with one-character names, braces are
optional, as in pN
to indicate a numeric
character (not necessarily decimal--Roman numerals are numeric
characters too). These property classes may be used by themselves or
combined in a constructed character class:
if ($var =~ /^p{IsAlpha}+$/) { print "all alphabetic" } if ($var =~ s/[p{Zl}p{Zp}]/ /g) { print "fixed newline wannabes" }
Some properties are directly defined in the Unicode standard,
and some properties are composites defined by Perl, based on the
standard properties. Zl
and Zp
are standard Unicode properties representing line separators and
paragraph separators, while IsAlpha
is defined by
Perl to be a property class combining the standard properties
Ll
, Lu
, Lt
,
and Lo
, (that is, letters that are lowercase,
uppercase, titlecase, or other). As of version 5.6.0 of Perl, you
need to use utf8
for these properties to work.
This restriction will be relaxed in the future.
There are a great many properties. We'll list the ones we know about, but the list is necessarily incomplete. New properties are likely to be in new versions of Unicode, and you can even define your own properties. More about that later.
The Unicode Consortium produces the online resources
that turn into the various files Perl uses in its Unicode
implementation. For more about these files, see Chapter 15. You can get a nice
overview of Unicode in the document
PATH_TO_PERLLIB
/unicode/Unicode3.html
where PATH_TO_PERLLIB
is what is printed
out by:
perl -MConfig -le 'print $Config{privlib}'
Most Unicode properties are of the form
p{Is
PROP
}
.
The Is
is optional, since it's so common, but you
may prefer to leave it in for readability.
First, Table 5.9 lists Perl's composite properties. They're defined to be reasonably close to the standard POSIX definitions for character classes.
Table 5-9. Composite Unicode Properties
Property | Equivalent |
---|---|
IsASCII | [x00-x7f] |
IsAlnum | [p{IsLl}p{IsLu}p{IsLt}p{IsLo}p{IsNd}] |
IsAlpha | [p{IsLl}p{IsLu}p{IsLt}p{IsLo}] |
IsCntrl | p{IsC} |
IsDigit | p{Nd} |
IsGraph | [^pCp{IsSpace}] |
IsLower | p{IsLl} |
IsPrint | P{IsC} |
IsPunct | p{IsP} |
IsSpace | [
f
p{IsZ}] |
IsUpper | [p{IsLu}p{IsLt}] |
IsWord | [_p{IsLl}p{IsLu}p{IsLt}p{IsLo}p{IsNd}] |
IsXDigit | [0-9a-fA-F] |
Perl also provides the following composites for each of main categories of standard Unicode properties (see the next section):
Property | Meaning | Normative |
---|---|---|
IsC | Crazy control codes and such | Yes |
IsL | Letters | Partly |
IsM | Marks | Yes |
IsN | Numbers | Yes |
IsP | Punctuation | No |
IsS | Symbols | No |
IsZ | Separators (Zeparators?) | Yes |
Table 5.10 lists the most basic standard Unicode properties, derived from each character's category. No character is a member of more than one category. Some properties are normative; others are merely informative. See the Unicode Standard for the standard spiel on just how normative the normative information is, and just how informative the informative information isn't.
Table 5-10. Standard Unicode Properties
Property | Meaning | Normative |
---|---|---|
IsCc | Other, Control | Yes |
IsCf | Other, Format | Yes |
IsCn | Other, Not assigned | Yes |
IsCo | Other, Private Use | Yes |
IsCs | Other, Surrogate | Yes |
IsLl | Letter, Lowercase | Yes |
IsLm | Letter, Modifier | No |
IsLo | Letter, Other | No |
IsLt | Letter, Titlecase | Yes |
IsLu | Letter, Uppercase | Yes |
IsMc | Mark, Combining | Yes |
IsMe | Mark, Enclosing | Yes |
IsMn | Mark, Nonspacing | Yes |
IsNd | Number, Decimal digit | Yes |
IsNl | Number, Letter | Yes |
IsNo | Number, Other | Yes |
IsPc | Punctuation, Connector | No |
IsPd | Punctuation, Dash | No |
IsPe | Punctuation, Close | No |
IsPf | Punctuation, Final quote | No |
IsPi | Punctuation, Initial quote | No |
IsPo | Punctuation, Other | No |
IsPs | Punctuation, Open | No |
IsSc | Symbol, Currency | No |
IsSk | Symbol, Modifier | No |
IsSm | Symbol, Math | No |
IsSo | Symbol, Other | No |
IsZl | Separator, Line | Yes |
IsZp | Separator, Paragraph | Yes |
IsZs | Separator, Space | Yes |
Another useful set of properties has to do with whether a given character can be decomposed (either canonically or compatibly) into other simpler characters. Canonical decomposition doesn't lose any formatting information. Compatibility decomposition may lose formatting information such as whether a character is a superscript.
Property | Information Lost |
---|---|
IsDecoCanon | Nothing |
IsDecoCompat | Something (one of the following) |
IsDCcircle | Circle around character |
IsDCfinal | Final position preference (Arabic) |
IsDCfont | Variant font preference |
IsDCfraction | Vulgar fraction characteristic |
IsDCinitial | Initial position preference (Arabic) |
IsDCisolated | Isolated position preference (Arabic) |
IsDCmedial | Medial position preference (Arabic) |
IsDCnarrow | Narrow characteristic |
IsDCnoBreak | Nonbreaking preference on space or hyphen |
IsDCsmall | Small characteristic |
IsDCsquare | Square around CJK character |
IsDCsub | Subscription |
IsDCsuper | Superscription |
IsDCvertical | Rotation (horizontal to vertical) |
IsDCwide | Wide characteristic |
IsDCcompat | Identity (miscellaneous) |
Here are some properties of interest to people doing bidirectional rendering:
Property | Meaning |
---|---|
IsBidiL | Left-to-right (Arabic, Hebrew) |
IsBidiLRE | Left-to-right embedding |
IsBidiLRO | Left-to-right override |
IsBidiR | Right-to-left |
IsBidiAL | Right-to-left Arabic |
IsBidiRLE | Right-to-left embedding |
IsBidiRLO | Right-to-left override |
IsBidiPDF | Pop directional format |
IsBidiEN | European number |
IsBidiES | European number separator |
IsBidiET | European number terminator |
IsBidiAN | Arabic number |
IsBidiCS | Common number separator |
IsBidiNSM | Nonspacing mark |
IsBidiBN | Boundary neutral |
IsBidiB | Paragraph separator |
IsBidiS | Segment separator |
IsBidiWS | Whitespace |
IsBidiON | Other Neutrals |
IsMirrored | Reverse when used right-to-left |
The following properties classify various syllabaries according to vowel sounds:
IsSylA IsSylE IsSylO IsSylWAA IsSylWII IsSylAA IsSylEE IsSylOO IsSylWC IsSylWO IsSylAAI IsSylI IsSylU IsSylWE IsSylWOO IsSylAI IsSylII IsSylV IsSylWEE IsSylWU IsSylC IsSylN IsSylWA IsSylWI IsSylWV
For example, p{IsSylA}
would match
N{KATAKANA LETTER KA}
but not
N{KATAKANA LETTER KU}
.
Now that we've basically told you all these Unicode 3.0 properties, we should point out that a few of the more esoteric ones aren't implemented in version 5.6.0 of Perl because its implementation was based in part on Unicode 2.0, and things like the bidirectional algorithm were still being worked out. However, by the time you read this, the missing properties may well be implemented, so we listed them anyway.
Some Unicode properties are of the form
p{In
SCRIPT
}
.
(Note the distinction between Is
and
In
.) The In
properties are
for testing block ranges of a particular
SCRIPT
. If you have a character, and
you wonder whether it were written in Greek script, you could test
with:
print "It's Greek to me! " if chr(931) =~ /p{InGreek}/;
That works by checking whether a character is "in" the valid
range of that script type. This may be negated with
P{In
SCRIPT
}
to find out whether something isn't in a
particular script's block, such as
P{InDingbats}
to test whether a string
contains a non-dingbat. Block properties include the
following:
InArabic InCyrillic InHangulJamo InMalayalam InSyriac InArmenian InDevanagari InHebrew InMongolian InTamil InArrows InDingbats InHiragana InMyanmar InTelugu InBasicLatin InEthiopic InKanbun InOgham InThaana InBengali InGeorgian InKannada InOriya InThai InBopomofo InGreek InKatakana InRunic InTibetan InBoxDrawing InGujarati InKhmer InSinhala InYiRadicals InCherokee InGurmukhi InLao InSpecials InYiSyllables
Not to mention jawbreakers like these:
InAlphabeticPresentationForms InHalfwidthandFullwidthForms InArabicPresentationForms-A InHangulCompatibilityJamo InArabicPresentationForms-B InHangulSyllables InBlockElements InHighPrivateUseSurrogates InBopomofoExtended InHighSurrogates InBraillePatterns InIdeographicDescriptionCharacters InCJKCompatibility InIPAExtensions InCJKCompatibilityForms InKangxiRadicals InCJKCompatibilityIdeographs InLatin-1Supplement InCJKRadicalsSupplement InLatinExtended-A InCJKSymbolsandPunctuation InLatinExtended-B InCJKUnifiedIdeographs InLatinExtendedAdditional InCJKUnifiedIdeographsExtensionA InLetterlikeSymbols InCombiningDiacriticalMarks InLowSurrogates InCombiningHalfMarks InMathematicalOperators InCombiningMarksforSymbols InMiscellaneousSymbols InControlPictures InMiscellaneousTechnical InCurrencySymbols InNumberForms InEnclosedAlphanumerics InOpticalCharacterRecognition InEnclosedCJKLettersandMonths InPrivateUse InGeneralPunctuation InSuperscriptsandSubscripts InGeometricShapes InSmallFormVariants InGreekExtended InSpacingModifierLetters
And the winner is:
InUnifiedCanadianAboriginalSyllabics
See
PATH_TO_PERLLIB
/unicode/In/*.pl
to get an up-to-date listing of all of these character block
properties. Note that these In
properties are
only testing to see if the character is in the block of characters
allocated for that script. There is no guarantee that all characters
in that range are defined; you also need to test against one of
the Is
properties discussed earlier to see if
the character is defined. There is also no guarantee that a
particular language doesn't use characters outside its assigned
block. In particular, many European languages mix extended Latin
characters with Latin-1 characters.
But hey, if you need a particular property that isn't provided, that's not a big problem. Read on.
To define your own property, you need to write a subroutine with the name of the property you want (see Chapter 6). The subroutine should be defined in the package that needs the property (see Chapter 10), which means that if you want to use it in multiple packages, you'll either have to import it from a module (see Chapter 11), or inherit it as a class method from the package in which it is defined (see Chapter 12).
Once you've got that all settled, the subroutine should
return data in the same format as the files in
PATH_TO_PERLLIB
/unicode/Is
directory. That is, just return a list of characters or character
ranges in hexadecimal, one per line. If there is a range, the two
numbers are separated by a tab. Suppose you wanted a property that
would be true if your character is in the range of either of the
Japanese syllabaries, known as hiragana and katakana. (Together
they're known as kana). You can just put in the two ranges like
this:
sub InKana { return <<'END'; 3040 309F 30A0 30FF END }
Alternatively, you could define it in terms of existing property names:
sub InKana { return <<'END'; +utf8::InHiragana +utf8::InKatakana END }
You can also do set subtraction using a
"-
" prefix. Suppose you only wanted the actual
characters, not just the block ranges of characters. You could
weed out all the undefined ones like this:
sub IsKana { return <<'END'; +utf8::InHiragana +utf8::InKatakana -utf8::IsCn END }
You can also start with a complemented character
set using the "!
" prefix:
sub IsNotKana { return <<'END'; !utf8::InHiragana -utf8::InKatakana +utf8::IsCn END }
Perl itself uses exactly the same tricks to define
the meanings of its "classic" character classes (like
w
) when you include them in your own custom
character classes (like [-.ws]
). You might
think that the more complicated you get with your rules, the
slower they will run, but in fact, once Perl has calculated the
bit pattern for a particular 64-bit swatch of your property, it
caches it so it never has to recalculate the pattern again. (It
does it in 64-bit swatches so that it doesn't even have to decode
your utf8 to do its lookups.) Thus, all character classes,
built-in or custom, run at essentially the same speed (fast) once
they get going.
Unlike Perl's other character class shortcuts, the
POSIX-style character-class syntax notation,
[
:CLASS
:]
,
is available for use only when constructing
other character classes, that is, inside an additional pair of
square brackets. For example,
/[.,[:alpha:][:digit:]]/
will search for one
character that is either a literal dot (because it's in a character
class), a comma, an alphabetic character, or a digit.
The POSIX classes available as of revision 5.6 of Perl are shown in Table 5.11.
Table 5-11. POSIX Character Classes
Class | Meaning |
---|---|
alnum | Any alphanumeric, that is, an
|
alpha | Any letter. (That's a lot more letters than you think, unless you're thinking Unicode, in which case it's still a lot.) |
ascii | Any character with an ordinal value between 0 and 127. |
cntrl | Any control character. Usually characters that
don't produce output as such, but instead control the
terminal somehow; for example, newline, form feed, and
backspace are all control characters. Characters with an
|
digit | A character representing a decimal digit, such
as |
graph | Any alphanumeric or punctuation character. |
lower | A lowercase letter. |
print | Any alphanumeric or punctuation character or space. |
punct | Any punctuation character. |
space | Any space character. Includes tab, newline,
form feed, and carriage return (and a lot more under
Unicode.) Equivalent to
|
upper | Any uppercase (or titlecase) letter. |
word | Any identifier character, either an
|
xdigit | Any hexadecimal digit. Though this may seem
silly ( |
You can negate the POSIX character classes by
prefixing the class name with a ^
following the
[
:. (This is a Perl extension.) For
example:
If the use utf8
pragma is not requested,
but the use locale
pragma is, the classes
correlate directly with the equivalent functions in the C library's
isalpha (3) interface (except for
word
, which is a Perl extension, mirroring
w
).
If the utf8
pragma is used, POSIX character
classes are exactly equivalent to the corresponding
Is
properties listed in Table 5.9. For example
[:lower:]
and p{Lower}
are
equivalent, except that the POSIX classes may only be used within
constructed character classes, whereas Unicode properties have no
such restriction and may be used in patterns wherever Perl shortcuts
like s
and w
may be
used.
The brackets are part of the POSIX-style
[::]
construct, not part of the whole character
class. This leads to writing patterns like
/^[[:lower:][:digit:]]+$/
, to match a string
consisting entirely of lowercase letters or digits (plus an optional
trailing newline). In particular, this does not work:
42 =~ /^[:digit:]$/ # WRONG
That's because it's not inside a character class. Rather, it
is a character class, the one representing the
characters ":", "i
", "t
",
"g
", and "d
". Perl doesn't
care that you specified ":" twice.
Here's what you need instead:
42 =~ /^[[:digit:]]+$/
The POSIX character classes [.cc.]
and
[=cc=]
are recognized but produce an error
indicating they are not supported. Trying to use
any POSIX character class in older verions of
Perl is likely to fail miserably, and perhaps even silently. If
you're going to use POSIX character classes, it's best to require a
new version of Perl by saying:
use 5.6.0;