Regexes

Pattern matching against strings

Regular expressions are a computer science concept where simple patterns describe the format of text. Pattern matching is the process of applying these patterns to actual text to look for matches.

Most modern regular expression facilities are more powerful than traditional regular expressions due to the influence of languages such as Perl, but the short-hand term regex has stuck and continues to mean regular expression-like pattern matching.

In Perl 6, although they are capable of much more than regular languages, we continue to call them regex.

Lexical conventions

Perl 6 has special syntax for writing regexes:

m/abc/;         # a regex that is immediately matched against $_
rx/abc/;        # a Regex object
/abc/;          # a Regex object

The first two can use delimiters other than the slash:

m{abc};
rx{abc};

Note that neither the colon : nor round parentheses can be delimiters; the colon is forbidden because it clashes with adverbs, such as rx:i/abc/ (case insensitive regexes), and round parenthese indicate a function call instead.

Whitespace in regexes is generally ignored (except with the :s or :sigspace adverb).

As in the rest of Perl 6, comments in regexes start with a hash character # and go upto the rest of the line.

Literals

The simplest case of a regex is a constant string. Matching a string against that regex searches for that string:

if 'properly' ~~ m/ perl / {
    say "'properly' contains 'perl'";
}

Alphanumeric characters and the underscore _ are literal matches. All other characters must either be escaped with a backslash (for example \: to match a colon), or included in quotes:

/ 'two words' /     # matches 'two words' including the blank
/ "a:b"       /     # matches 'a:b' including the colon

The hash character # cannot be escaped with a backslash, because that collides with the unspace syntax. So to match a hash character, you need to quote it:

/'#'/               # hashes must be quoted, cannot be escaped

Strings are searched left to right for the regex, so its enough if a substring matches the regex:

if 'abcdef' ~~ / de / {
    say ~$/;            # de
    say $/.prematch;    # abc
    say $/.postmatch;   # f
    say $/.from;        # 3
    say $/.to;          # 5
};

Match results are stored in the $/ variable, and are also returned from the match. The result is of type Match, if the match was successful. Otherwise it is Nil.

Wildcards and character classes

Dot to match any character

An unescaped dot . in a regex matches any single character.

So these all match:

'perl' ~~ / per . /;    # matches the whole string
'perl' ~~ /per./;       # the same; whitespace is ignored
'perl' ~~ / pe.l /;     # the . matches the r
'speller' ~~ / pe.l/;   # the . matches the first l

This doesn't match:

'perl' ~~ /. per /

because there is no character to match before per in the target string.

Backslashed, predefined character classes

There are predefined character classes of the form \w. Its negation is written with an upper-case letter, \W.

\d matches a single digit (Unicode property N), and \D matches a single character that is not a digit.

'ab42' ~~ /\d/ and say ~$/;     # 4
'ab42' ~~ /\D/ and say ~$/;     # a

Note that not only the Arabic digits (commonly used in the Latin alphabet) match \d, but also digits from other scripts.

Examples for digits are

U+0035 5 DIGIT FIVE
U+07C2 ߂ NKO DIGIT TWO
U+0E53 ๓ THAI DIGIT THREE
U+1B56 ᭖ BALINESE DIGIT SIX

\h matches a single horizontal whitespace character. \H matches a single character that is not a horizontal whitespace character.

Examples for horizontal whitespace characters are

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+0009 CHARACTER TABULATION
U+2001 EM QUAD

Vertical whitespaces like newline characters are explicitly excluded; those can be matched with \v, and \s matches any kind of whitespace.

\n matches a single, logical newline character. \n is supposed to also match a Windows CR LF codepoing pair; though it is unclear whether the magic happens at the time that external data is read, or at regex match time. \N matches a single character that's not a logical newline.

\s matches a single whitespace character. \S matches a single character that is not a whitspace.

TODO: examples

\t matches a single tab/tabulation character, U+0009. (Note that exotic tabs like the U+000B VERTICAL TABULATION character are not included here). \T matches a single character that is not a tab.

\v matches a single vertical whitespace character. \V match a single character that is not a vertical whitspace.

Examples for vertical whitespace characters:

U+000A LINE FEED
U+000B VERTICAL TABULATION
U+000C CARRIAGE RETURN
U+0085 NEXT LINE
U+2029 PARAGRAPH SEPARATOR

Use \s to match any kind of whitespace, not just vertical whitespace

\w matches a single word character, that is a letter (Unicode category L), a digit or an underscore. \W matches a single character that isn't a word character.

Examples of word characters:

0041 A LATIN CAPITAL LETTER A
0031 1 DIGIT ONE
03B4 δ GREEK SMALL LETTER DELTA
03F3 ϳ GREEK LETTER YOT
0409 Љ CYRILLIC CAPITAL LETTER LJE

Unicode properties

The character classes so far are mostly for convenience; a more systematic approach is the use of Unicode properties. They are called in the form <:property> , where property can be a short or long Unicode property name.

The following list is stolen from the Perl 5 perlunicode documentation:

Short Long
L Letter
LC Cased_Letter
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation (also Punct)
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation
(may behave like Ps or Pe depending on usage)
Pf Final_Punctuation
(may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other
Cc Control (also Cntrl)
Cf Format
Cs Surrogate
Co Private_Use
Cn Unassigned

So for example <:Lu> matches a single, upper-case letter.

Negation works as <:!category> , so <:!Lu> matches a single character that isn't an upper-case letter.

Several category can be combined with one of these infix operators:

Operator Meaning
+ set union
| set union
& set intersection
- set difference (first minus second)
^ symmetric set intersection / XOR

So for example to either match a lower-case letter or a number, one can write <:Ll+:N> or <:Ll+:Number> or <+ :Lowercase_Letter + :Number> .

(Grouping of set operations with round parens inside character classes is supposed to work, but not supported by Rakudo at the time of writing).

Enumerated character classes and ranges

Sometimes the pre-existing wildcards and character classes are just not enough. Fortunately, defining your own is simple enough. Between <[ ]> , you can put any number of single characters and ranges of characters (expressed with two dots between the end points) between them, with or without whitespace.

"abacabadabacaba" ~~ / <[ a .. c 1 2 3 ]> /

In between the < > , you can also use the same operators for categories (+, |, &, -, ^) to combine multiple range definitions and even mix in some of the unicode categories above. Another thing you are allowed to write between the [ ] is the backslashed forms for character classes.

/ <[\d] - [13579]> /
# not quite the same as
/ <[02468]>
# because the first one also contains "weird" unicodey digits

Quantifier

A quantifier makes a preceding atom match not exactly once, but rather a variable number of times. For example a+ matches one or more a characters.

Quantifiers bind tighter than concatenation, so ab+ matches one a followed by one or more bs. This is different for quotes, so 'ab'+ matches the strings ab, abab, ababab etc.

One or more: +

The + quantifier makes the preceding atom match one or more times, with no upper limit.

For example to match strings of the form key=value, you can write a regex like this:

/ \w+ '=' \w+ /

Zero or more: *

The * quantifier makes the preceding atom match zero or more times, with no upper limit.

For example to optional whitespace between a and b you can write

/ a \s* b /

Zero or one match: ?

The ? quantifier makes the preceding atom match zero or one time.

General quantifier: ** min..max

To quantifier an atom an arbitrary number of times, you can say for example a ** 2..5 to match the character a at least twice and at most 5 times

If minimal and maximal number of matches are the same, a single integer is possible: a ** 5 to match a exactly five times.

Alternation

To match one of several possible alternatives, separate them by ||; the first matching alternative wins.

For example ini files look like this:

[section]
key = value

So if you parse a single line of an ini file, it can be either a section or a key-value pair, and the regex would be (in first approximation):

/ '[' \w+ ']' || \S+ \s* '=' \s* \S* /

That is, either a word surrounded by brakets, or a string of non-whitespace characters, followed by zero or more spaces, followed by the equals sign =, followed again by optional whitespace, followed by another string of non-whitespace characters.

Anchors

The regex engine tries to find a match inside a string, by searching from left to right.

say so 'properly' ~~ / perl/;   # True
#          ^^^^

But sometimes this is not what you want, and you want to match the whole string, or a whole line, or one or several whole words. Anchors or assertions can help you with that, by limiting where they match.

Anchors need to match successfully in order for the whole regex to match, but they do not use up characters while matching.

^, Start of String

The ^ assertion only matches at the start of the string.

say so 'properly' ~~ /perl/;        # True
say so 'properly' ~~ /^ perl/;      # False
say so 'perly'    ~~ /^ perl/;      # True
say so 'perl'     ~~ /^ perl/;      # True

^^, Start of Line and $$, End of Line

The ^^ assertion matches at the start of a logical line. That is, either at the start of the string, or after a newline character.

$$ matches only at the end of a logical line, that is, before a newline character, or at the end of the string when the last character is not a newline character.

(To understand the following example, it is important to know that the q:to/EOS/...EOS "heredoc" syntax removes leading indention to the same level as the EOS marker, so that first, second and last lines have no leading space, and the third and fourth lines have two leading spaces each).

my $str = q:to/EOS/;
    There was a young man of Japan
    Whose limericks never would scan.
      When asked why this was,
      He replied "It's because
    I always try to fit as many syllables into the last line as ever I possibly can."
    EOS
say so $str ~~ /^^ There/;          # True  (start of string)
say so $str ~~ /^^ limericks/;      # False (not at the start of a line)
say so $str ~~ /^^ I/;              # True  (start of the last line)
say so $str ~~ /^^ When/;           # False (there are blanks between
                                    #        start of line and the "When")
say so $str ~~ / Japan $$/;         # True  (end of first line)
say so $str ~~ / scan $$/;          # False (there is a . between "scan"
                                    #        and the end of line)
say so $str ~~ / '."' $$/;          # True  (at the last line)

<< and >> , left and right word boundary

<< matches a left word boundary, so positions where at the left there a non-word character (or the start of the string), and to the right there is a word character.

>> matches a right word boundary, so positions where at the left there is a word character, and at the right there is a non-word character, or the end of the string.

my $str = 'The quick brown fox';
say so $str ~~ /br/;                # True
say so $str ~~ /<< br/;             # True
say so $str ~~ /br >>/;             # False
say so $str ~~ /own/;               # True
say so $str ~~ /<< own/;            # False
say so $str ~~ /own >>/;            # True

Grouping and Capturing

In regular (non-regex) Perl 6, you can use parenthesis to group things together, usually to override operator precedence:

say 1 + 4 * 2;      # 9, because it is parsed as 1 + (4 * 2)
say (1 + 4) * 2;    # 10

The same grouping facility is available in regexes:

/ a || b c /        # matches 'a' or 'bc'
/ ( a || b ) c /    # matches 'ac' or 'bc'

The same grouping applies to quantifiers:

/ a b+ /            # Matches an 'a' followed by one or more 'b's
/ (a b)+ /          # Matches one or more sequences of 'ab'
/ (a || b)+ /       # Matches a sequence of 'a's and 'b's, at least one long

An unquantified capture produces a Match object. When a capture is quantified (except with the ? quantifier), the capture becomes a list of Match objects instead.

Capturing

The round parenthesis don't just group, they also capture; that is, they make the string that is matched by grouped part available:

my $str =  'number 42';
if $str ~~ /'number ' (\d+) / {
    say "The number is $0";
}

Pairs of parenthesis are numbered left to right, starting from zero.

if 'abc' ~~ /(a) b (c)/ {
    say "0: $0; 1: $1";     # 0: a; 1: c
}

The $0 and $1 etc. syntax is actually just a short-hand; these captures are canonically available from the match object $/ by using it as a list, so $0 is actually a short way to write $/[0].

Coercing the match object to a list gives an easy way to programmatically access all elements:

if 'abc' ~~ /(a) b (c)/ {
    say $/.list.join: ', '  # a, c
}

Non-capturing grouping

The parenthesis in regexes perform a double role: they group the regex elements inside, and they capture what is matched by the sub-regex inside.

To get only the grouping behavior, you can use brackets [ ... ] instead.

if 'abc' ~~ / [a||b] (c) / {
    say ~$0;                # c
}

If you do not need the captures, using non-capturing groups provides three benefits: it communicates the intent more clearly, it makes it easier to count the capturing groups that you do care about, and it is a bit faster.

Capture Numbers

The statement above that captures are numbered from left to right. While true in principle, it is also overly simplistic.

The following rules are listed for the sake of completeness; when you find yourself using them regularly, it is worth considering named captures (and possibly subrules) instead.

Alternations resets the capture count:

/ (x) (y)  || (a) (.) (.) /
# $0  $1      $0  $1  $2

Example:

if 'abc' ~~ /(x)(y) || (a)(.)(.)/ {
    say ~$1;            # b
}

Captures can be nested, in which case they are numbered per level

if 'abc' ~~ / ( a (.) (.) ) / {
    say "Outer: $0";              # Outer: abc
    say "Inner: $0[0] and $0[1]"; # Inner: b and c
}

Named Captures

Instead of numbering captures, you can also give them names. The generic, and slightly verbose way of giving out names is like this:

if 'abc' ~~ / $<myname> = [ \w+ ] / {
    say ~$<myname>      # abc
}

The access to the named capture, $<myname> , is a shortcut for indexing the match object as a hash, so $/{ 'myname' } or $/<myname> .

Coercing the match object to a hash gives you easy programmatic access to all named captures:

if 'count=23' ~~ / $<variable>=\w+ '=' $<value>=\w+ / {
    my %h = $/.hash;
    say %h.keys.sort.join: ', ';        # value, variable
    say %h.values.sort.join: ', ';      # 23, count
    for %h.kv -> $k, $v {
        say "Found value '$v' with key '$k'";
        # outputs two lines:
        #   Found value 'count' with key 'variable'
        #   Found value '23' with key 'value'
    }
}

But there is a more convenient way to get named captures, discussed in the next section.

Subrules

Just like you can put pieces of code into subroutines, so you can also put pieces of regex into named rules.

my regex line { \N*\n }
if "abc\ndef" ~~ /<line> def/ {
    say "First line: ", $<line>.chomp;      # first line: abc
}

A named regex can be declared with my regex thename { body here }, and called with <thename> . At the same time, calling a named regex installs a named capture with the same name.

If the capture should be of a different name, that can be achieved with the syntax <capturename=regexname> . If no capture at all is desired, a leading dot will suppress it: <.regexname> .

Here is a bit more complete (yet still fairly limited) code for parsing ini files:

my regex header { \s* '[' (\w+) ']' \h* \n+ }
my regex identifier  { \w+ }
my regex kvpair { \s* <key=identifier> '=' <value=identifier> \n+ }
my regex section {
    <header>
    <kvpair>*
}
my $contents = q:to/EOI/;
    [passwords]
        jack=password1
        joy=muchmoresecure123
    [quotas]
        jack=123
        joy=42
EOI
my %config;
if $contents ~~ /<section>*/ {
    for $<section>.list -> $section {
        my %section;
        for $section<kvpair>.list -> $p {
            say $p<value>;
            %section{ $p<key> } = ~$p<value>;
        }
        %config{ $section<header>[0] } = %section;
    }
}
say %config.perl;
# ("passwords" => {"jack" => "password1", "joy" => "muchmoresecure123"},
#    "quotas" => {"jack" => "123", "joy" => "42"}).hash

Adverbs

Adverbs modify how regexes work, and give very convenient shortcuts for certain kinds of recurring tasks.

There are two kinds of adverbs: regex adverbs apply at the point where a regex is defined, and matching adverbs apply at the point that a regex matches against a string.

This distinction often blurs, because matching and declaration are often textually close, but using the method form of matching makes the distiction clear.

'abc' ~~ /../ is roughly equvalent to 'abc'.match(/../), or even more clearly written in separate lines:

my $regex = /../;           # definition
if 'abc'.match($regex) {    # matching
    say "'abc' has at least two characters";
}

Regex adverbs like :i go into the definition line, and matching adverbs like :overlap go along with the matching:

my $regex = /:i . a/;
for 'baA'.match($regex, :overlap) -> $m {
    say ~$m;
}
# output:
#     ba
#     aA

Regex Adverbs

Adverbs that appear at the time of a regex declaration are part of the actual regex, and influences how the Perl 6 compiler translates the regex into binary code.

For example the :ignorecase or short :i adverb tells the compiler to ignore the distinction between upper case, lower case and title case letters.

So 'a' ~~~ /A/ is false, but 'a' ~~ /:i A/ is a successful match.

Regex adverbs can come before or inside a regex declaration, and only affect the part of the regex that comes afterwards, lexically.

These two regexes are equivalent:

my $rx1 = rx:i/a/;      # before
my $rx2 = rx/:i a/;     # inside

Whereas these two are not:

my $rx3 = rx/a :i b/;   # matches only the b case insensitively
my $rx4 = rx/:i a b/;   # matches completly case insensitively

Brackets and parenthesis limit the scope of an adverb:

/ (:i a b) c /          # matches 'ABc' but not 'ABC'
/ [:i a b] c /          # matches 'ABc' but not 'ABC'

Ratchet

The :ratchet or :r adverb causes the regex engine not to backtrack.

Without that adverb, parts of a regex will try different ways to match a string, in order to make it possible for other parts of the regex to match. For example in 'abc' ~~ /\w+ ./, the \w+ first eats up the whole string, abc, but then the . fails. Thus \w+ gives up a character, matching only ab, and the . can successfully match the string c. This process of giving up characters (or in the case of alternations, trying a different branch) is known as backtracking.

say so 'abc' ~~ / \w+ . /;      # True
say so 'abc' ~~ / :r \w+ . /;   # False

Ratcheting can be an optimization, because backtracking is costly. But more importantly, it closely corresponds to how humans parse a text. If you have a regex my regex identifier { \w+ } and my regex keyword { if | else | endif }, you intuitively expect the identifier to gobble up a whole rule, and not have it give up its end to the next rule, if the next rule otherwise fails. You don't expect the word motif to be parsed as the identifier mot followed by the keyword if; rather you expect motif to be parsed as one identifier, and if the parser expects and if afterwards, rather have it fail than parse the input in a way you don't expect.

Since ratcheting behavior is so often desirable in parsers, there is a shortcut to declaring ratcheting regex:

my token thing { .... }
# short for
my regex thing { :r ... }

Sigspace

The :sigspace or :s adverb makes whitespace significant in a regex.

say so "I used Photoshop®"  ~~ m:i/   photo shop /; # True
say so "I used a photo shop ~~ m:i:s/ photo shop /; # True
say so "I used Photoshop®"  ~~ m:i:s/ photo shop /; # False

m:s/ photo shop / acts just the same as if one had written m/ photo <ws> shop <ws> /. By default, <ws> makes sure that words are seperated, so a b and ^& will match <ws> in the middle, but ab won't.

If a regex is declared with the rule keyword, the :sigspace adverb and :ratchet are implied

Matching Adverbs

In contrast to regex adverbs, which are tied to the declaration of a regex, matching adverbs only make sense while matching a string against a regex.

They can never appear inside a regex, only on the outside - either as part of an m/.../ match, or as arguments to a match method.

Continue

The :continue or short :c adverb takes an argument. The argument is the position where the regex should start to search. By default, it searches from the start of the string, but :c overrides that.

given 'a1xa2' {
    say ~m/a./;         # a1
    say ~m:c(2)/a./;    # a2
}

Exhaustive

TODO

Global

TODO

Pos

TODO

Overlap

TODO