Random notes on Perl Regular Expressions
It’s 2022, Perl isn’t as popular as it used to be, and for a moment I questioned its relevance. Until I had a task requiring a lot of pattern matching, which reminded me why Perl is that loyal companion that always has an on-spot solution to whatever I need.
These are a few notes I took as I discovered the more advanced, and well-needed, features of Perl regexps.
- If a regex is passed as a value generated by qr//, the modifiers in this qr// have a significance. So e.g. if the match should be case-insensitive, add it after the qr//.
- Quantifiers can be used on regex groups, whether they capture or not. For example, \d+(?:\.\d+)+ means one or more digits followed by one or more patterns of a dot and one or more digits. Think BNF.
- Complex regular expressions can be created relatively easily by breaking them down into smaller pieces and assigning each a variable with qr//. The complex expression becomes fairly readable this way. Almost needless to say, quantifiers can be applied on each of these subexpressions.
- It’s possible to give capture elements names, e.g. $t =~ /^(?<pre>.*?)(?<found>[ \t\n]*${regex}[ \t\n]*)(?<post>.*)$/s. The capture results then appear in e.g. $+{pre}, $+{found} and $+{post}. This is useful in particular if the regex in the middle may have capture elements of its own, so the usual counting method doesn’t work.
- Captured elements can be used in the regex itself, e.g. /([\'\"])(.*?)\1/ so \1 stands for either a single or double quote, whichever was found.
- Even better, there’s e.g \g{-1} instead of numeric grouping, which in this case means that last group captured. Once again, useful in a regex that can be used in more complicated contexts.
- When there are nested unnamed capture parentheses, the outer parenthesis gets the first capture number.
- If there are several capture parentheses with a ‘|’ between them, all of them produce a capture position, but those that weren’t in use for matching get undef.
- (?:…) grouping can be followed by a quantifier, so this makes perfect sense ((?:[^\\\{\}]|\\\\|\\\{|\\\})*) for any number of characters that aren’t a backslash or a curly bracket, or any of these followed by an escape.
- Quantifiers can be super-greedy in the sense that they don’t allow backtracking. So e.g. /a++b/ is exactly like /a+b/, but with the former the computer won’t attempt to consume less a’s (if such are found) in order to try to find a “b”. This is just an optimization for speed. All of these extra-greedy quantifiers are made with an extra plus sign.
- There’s lookbehind and lookahead assertions, which are really great. In particular, the negative assertions. E.g. /(?<![ \t\n\r])(d+)/ captures a number that isn’t after a whitespace, and /(\d+)(?![ \t\n\r])/ captures a number that isn’t followed by a whitespace. Note that the parentheses around these assertions are for grouping, but not capturing, so in these examples only the number was captured.
- Lookaheads and lookbehinds also work inside grouping parentheses (whether capturing or not), as grouping is treated as an independent regex.
Reader Comments
:-)