Inside a pattern or subpattern, use the
|
metacharacter to specify a set of possibilities,
any one of which could match. For instance:
/Gandalf|Saruman|Radagast/
matches Gandalf
or Saruman
or Radagast
. The alternation extends only as far as
the innermost enclosing parentheses (whether capturing or not):
/prob|n|r|l|ate/ # Match prob, n, r, l, or ate /pro(b|n|r|l)ate/ # Match probate, pronate, prorate, or prolate /pro(?:b|n|r|l)ate/ # Match probate, pronate, prorate, or prolate
The second and third forms match the same strings, but the
second form captures the variant character in $1
and the third form does not.
At any given position, the Engine tries to match the first alternative, and then the second, and so on. The relative length of the alternatives does not matter, which means that in this pattern:
/(Sam|Samwise)/
$1
will never be set to
Samwise
no matter what string it's matched against,
because Sam
will always match first. When you have
overlapping matches like this, put the longer ones at the
beginning.
But the ordering of the alternatives only matters at a given
position. The outer loop of the Engine does left-to-right matching, so
the following always matches the first Sam
:
"'Sam I am,' said Samwise" =~ /(Samwise|Sam)/; # $1 eq "Sam"
But you can force right-to-left scanning by making use of greedy quantifiers, as discussed earlier in "Quantifiers":
"'Sam I am,' said Samwise" =~ /.*(Samwise|Sam)/; # $1 eq "Samwise"
You can defeat any left-to-right (or right-to-left)
matching by including any of the various positional assertions we saw
earlier, such as G
, ^
, and
$
. Here we anchor the pattern to the end of the
string:
"'Sam I am,' said Samwise" =~ /(Samwise|Sam)$/; # $1 eq "Samwise"
That example factors the $
out of the
alternation (since we already had a handy pair of parentheses to put
it after), but in the absence of parentheses you can also distribute
the assertions to any or all of the individual alternatives, depending
on how you want them to match. This little program
displays lines that begin with either a __DATA__
or
__END__
token:
#!/usr/bin/perl while (<>) { print if /^__DATA__|^__END__/; }
But be careful with that. Remember that the first and last
alternatives (before the first |
and after the last
one) tend to gobble up the other elements of the regular expression on
either side, out to the ends of the expression, unless there are
enclosing parentheses. A common mistake is to ask for:
/^cat|dog|cow$/
when you really mean:
/^(cat|dog|cow)$/
The first matches "cat
" at the beginning of
the string, or "dog
" anywhere, or
"cow
" at the end of the string. The second matches
any string consisting solely of "cat
" or
"dog
" or "cow
". It also captures
$1
, which you may not want. You can also
say:
/^cat$|^dog$|^cow$/
We'll show you another solution later.
An alternative can be empty, in which case it always matches.
/com(pound|)/; # Matches "compound" or "com" /com(pound(s|)|)/; # Matches "compounds", "compound", or "com"
This is much like using the ?
quantifier,
which matches 0 times or 1 time:
/com(pound)?/; # Matches "compound" or "com" /com(pound(s?))?/; # Matches "compounds", "compound", or "com" /com(pounds?)?/; # Same, but doesn't use $2
There is one difference, though. When you apply the
?
to a subpattern that captures into a numbered
variable, that variable will be undefined if there's no string to go
there. If you used an empty alternative, it would still be false, but
would be a defined null string instead.