Random notes on Perl Regular Expressions

This post was written by eli on July 10, 2022
Posted Under: perl

It’s 2022, Perl isn’t as popular as it used to be, and for a moment I questioned its relevance. Until I had a task requiring a lot of pattern matching, which reminded me why Perl is that loyal companion that always has an on-spot solution to whatever I need.

These are a few notes I took as I discovered the more advanced, and well-needed, features of Perl regexps.

  • If a regex is passed as a value generated by qr//, the modifiers in this qr// have a significance. So e.g. if the match should be case-insensitive, add it after the qr//.
  • Quantifiers can be used on regex groups, whether they capture or not. For example, \d+(?:\.\d+)+ means one or more digits followed by one or more patterns of a dot and one or more digits. Think BNF.
  • Complex regular expressions can be created relatively easily by breaking them down into smaller pieces and assigning each a variable with qr//. The complex expression becomes fairly readable this way. Almost needless to say, quantifiers can be applied on each of these subexpressions.
  • It’s possible to give capture elements names, e.g. $t =~ /^(?<pre>.*?)(?<found>[ \t\n]*${regex}[ \t\n]*)(?<post>.*)$/s. The capture results then appear in e.g. $+{pre}, $+{found} and $+{post}. This is useful in particular if the regex in the middle may have capture elements of its own, so the usual counting method doesn’t work.
  • Captured elements can be used in the regex itself, e.g. /([\'\"])(.*?)\1/ so \1 stands for either a single or double quote, whichever was found.
  • Even better, there’s e.g \g{-1} instead of numeric grouping, which in this case means that last group captured. Once again, useful in a regex that can be used in more complicated contexts.
  • When there are nested unnamed capture parentheses, the outer parenthesis gets the first capture number.
  • If there are several capture parentheses with a ‘|’ between them, all of them produce a capture position, but those that weren’t in use for matching get undef.
  • (?:…) grouping can be followed by a quantifier, so this makes perfect sense ((?:[^\\\{\}]|\\\\|\\\{|\\\})*) for any number of characters that aren’t a backslash or a curly bracket, or any of these followed by an escape.
  • Quantifiers can be super-greedy in the sense that they don’t allow backtracking. So e.g. /a++b/ is exactly like /a+b/, but with the former the computer won’t attempt to consume less a’s (if such are found) in order to try to find a “b”. This is just an optimization for speed. All of these extra-greedy quantifiers are made with an extra plus sign.
  • There’s lookbehind and lookahead assertions, which are really great. In particular, the negative assertions. E.g. /(?<![ \t\n\r])(d+)/ captures a number that isn’t after a whitespace, and /(\d+)(?![ \t\n\r])/ captures a number that isn’t followed by a whitespace. Note that the parentheses around these assertions are for grouping, but not capturing, so in these examples only the number was captured.
  • Lookaheads and lookbehinds also work inside grouping parentheses (whether capturing or not), as grouping is treated as an independent regex.

Reader Comments

:-)

#1 
Written By Albin A James on July 13th, 2022 @ 15:07

Add a Comment

required, use real name
required, will not be published
optional, your blog address