Random notes on Perl Regular Expressions

This post was written by eli on July 10, 2022
Posted Under: perl

It’s 2022, Perl isn’t as popular as it used to be, and for a moment I questioned its relevance. Until I had a task requiring a lot of pattern matching, which reminded me why Perl is that loyal companion that always has an on-spot solution to whatever I need.

These are a few notes I took as I discovered the more advanced, and well-needed, features of Perl regexps.

If a regex is passed as a value generated by qr//, the modifiers in this qr// have a significance. So e.g. if the match should be case-insensitive, add it after the qr//.
Quantifiers can be used on regex groups, whether they capture or not. For example, \d+(?:\.\d+)+ means one or more digits followed by one or more patterns of a dot and one or more digits. Think BNF.
Complex regular expressions can be created relatively easily by breaking them down into smaller pieces and assigning each a variable with qr//. The complex expression becomes fairly readable this way. Almost needless to say, quantifiers can be applied on each of these subexpressions.
It’s possible to give capture elements names, e.g. $t =~ /^(?<pre>.*?)(?<found>[ \t\n]*${regex}[ \t\n]*)(?<post>.*)$/s. The capture results then appear in e.g. $+{pre}, $+{found} and $+{post}. This is useful in particular if the regex in the middle may have capture elements of its own, so the usual counting method doesn’t work.
Captured elements can be used in the regex itself, e.g. /([\'\"])(.*?)\1/ so \1 stands for either a single or double quote, whichever was found.
Even better, there’s e.g \g{-1} instead of numeric grouping, which in this case means that last group captured. Once again, useful in a regex that can be used in more complicated contexts.
When there are nested unnamed capture parentheses, the outer parenthesis gets the first capture number.
If there are several capture parentheses with a ‘|’ between them, all of them produce a capture position, but those that weren’t in use for matching get undef.
(?:…) grouping can be followed by a quantifier, so this makes perfect sense ((?:[^\\\{\}]|\\\\|\\\{|\\\})*) for any number of characters that aren’t a backslash or a curly bracket, or any of these followed by an escape.
Quantifiers can be super-greedy in the sense that they don’t allow backtracking. So e.g. /a++b/ is exactly like /a+b/, but with the former the computer won’t attempt to consume less a’s (if such are found) in order to try to find a “b”. This is just an optimization for speed. All of these extra-greedy quantifiers are made with an extra plus sign.
There’s lookbehind and lookahead assertions, which are really great. In particular, the negative assertions. E.g. /(?<![ \t\n\r])(d+)/ captures a number that isn’t after a whitespace, and /(\d+)(?![ \t\n\r])/ captures a number that isn’t followed by a whitespace. Note that the parentheses around these assertions are for grouping, but not capturing, so in these examples only the number was captured.
Lookaheads and lookbehinds also work inside grouping parentheses (whether capturing or not), as grouping is treated as an independent regex.

Reader Comments

:-)

Written By Albin A James on July 13th, 2022 @ 15:07

Add a Comment

Next Post: Translating technical documentation with Google Translate

Previose Post: adb, fastboot and ssh and other system stuff on Google Pixel 6 Pro

my tech blog

Popular Posts

Latest Posts

Archives

Random notes on Perl Regular Expressions

Reader Comments

Add a Comment

Quick links

Categories

Meta