up: syntax
previous: passing by reference
next: formats
Regular Expressions
The patterns used in pattern matching are regular expres-
sions such as those supplied in the Version 8 regexp rou-
tines. (In fact, the routines are derived from Henry
Spencer's freely redistributable reimplementation of the V8
routines.) In addition, \w matches an alphanumeric character
(including "_") and \W a nonalphanumeric. Word boundaries
may be matched by \b, and non-boundaries by \B. A whi-
tespace character is matched by \s, non-whitespace by \S. A
numeric character is matched by \d, non-numeric by \D. You
may use \w, \s and \d within character classes. Also, \n,
\r, \f, \t and \NNN have their normal interpretations.
Within character classes \b represents backspace rather than
a word boundary. Alternatives may be separated by |. The
bracketing construct ( ... ) may also be used, in which case
\ matches the digit'th substring, where digit can
range from 1 to 9. (Outside of the pattern, always use $
instead of \ in front of the digit. The scope of $
(and $`, $& and $') extends to the end of the enclosing
BLOCK or eval string, or to the next pattern match with
subexpressions. The \ notation sometimes works out-
side the current pattern, but should not be relied upon.) $+
returns whatever the last bracket match matched. $& returns
the entire matched string. ($0 used to return the same
thing, but not any more.) $` returns everything before the
matched string. $' returns everything after the matched
string. Examples:
s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
if (/Time: (..):(..):(..)/) {
$hours = $1;
$minutes = $2;
$seconds = $3;
}
By default, the ^ character is only guaranteed to match at
the beginning of the string, the $ character only at the end
(or before the newline at the end) and perl does certain
optimizations with the assumption that the string contains
only one line. The behavior of ^ and $ on embedded newlines
will be inconsistent. You may, however, wish to treat a
string as a multi-line buffer, such that the ^ will match
after any newline within the string, and $ will match before
any newline. At the cost of a little more overhead, you can
do this by setting the variable $* to 1. Setting it back to
0 makes perl revert to its old behavior.
To facilitate multi-line substitutions, the . character
never matches a newline (even when $* is 0). In particular,
the following leaves a newline on the $_ string:
$_ = ;
s/.*(some_string).*/$1/;
If the newline is unwanted, try one of
s/.*(some_string).*\n/$1/;
s/.*(some_string)[^\000]*/$1/;
s/.*(some_string)(.|\n)*/$1/;
chop; s/.*(some_string).*/$1/;
/(some_string)/ && ($_ = $1);
Any item of a regular expression may be followed with digits
in curly brackets of the form {n,m}, where n gives the
minimum number of times to match the item and m gives the
maximum. The form {n} is equivalent to {n,n} and matches
exactly n times. The form {n,} matches n or more times.
(If a curly bracket occurs in any other context, it is
treated as a regular character.) The * modifier is
equivalent to {0,}, the + modifier to {1,} and the ? modif-
ier to {0,1}. There is no limit to the size of n or m, but
large numbers will chew up more memory.
You will note that all backslashed metacharacters in perl
are alphanumeric, such as \b, \w, \n. Unlike some other
regular expression languages, there are no backslashed sym-
bols that aren't alphanumeric. So anything that looks like
\\, \(, \), \<, \>, \{, or \} is always interpreted as a
literal character, not a metacharacter. This makes it sim-
ple to quote a string that you want to use for a pattern but
that you are afraid might contain metacharacters. Simply
quote all the non-alphanumeric characters:
$pattern =~ s/(\W)/\\$1/g;