Perl: “$” doesn’t really mean end of string
Who ate my newline?
It’s 2023, Perl is ranked below COBOL, but I still consider it as my loyal working horse. But even the most loyal horse will give you a grand kick in the bottom every now and then.
So let’s jump to the problematic code:
#!/usr/bin/perl
use warnings;
use strict;
my $str = ".\n\n";
my $nonn = qr/[ \t]|(?<!\n)\n(?!\n)/;
my ($pre, $match, $post) = ($str =~ /^($nonn*)(.*?)($nonn*)$/s);
print "pre = \"$pre\"\n";
print "match = \"$match\"\n";
print "post = \"$post\"\n";
print "This doesn't add up!\n"
unless ($str eq "$pre$match$post");
For now, never mind what I tried to do here. Let’s just note that $nonn doesn’t capture anything: Those two expressions with parentheses are a lookbehind and a lookahead, and hence don’t capture.
So now let’s look at
my ($pre, $match, $post) = ($str =~ /^($nonn*)(.*?)($nonn*)$/s);
This is an enclosure between ^ and $, and everything in the middle is captured into three matches. So no matter what, the concatenation of these three matches should equal $str, shouldn’t it? Let’s give it a test run:
$ ./try.pl
pre = ""
match = ".
"
post = ""
This doesn't add up!
So $pre and $post are empty. OK, fine. Hence $match should equal $str, which is “.\n\n”. But I see only one newline. Where’s the other one?
RTFM
The one thing that I really like about Perl, is that even when it plays a dirty trick, the answer is in the plain manual. As in “man perlre”, where it says, black on white in the description of $:
Match the end of the string (or before newline at the end of the string; or before any newline if /m is used)
So there we have it. “$” can also consider the character before the last newline as the end. Note that “$” itself will not match the last newline, so even if there’s a capture on the “$” itself, as in “($)”, that last newline is still not captured. It’s a Perl quirk. One of those things that make Perl do exactly what you really want, except for when you’re surgical about it.
I’ve been using Perl a lot for 20 years, and I wasn’t aware that “$” could match anything but the end of the string (let alone the “/m” modifier).
So that’s what happened above: $ considered the character before the last newline to be the end, and one newline went up in smoke.
Use \z instead
The second thing that I really like about Perl, is that even when it’s quirky, there’s always a simple solution. The same “man perlre” also says:
\z Match only at end of string
Simple, isn’t it? From now on and until the end of time, always use \z if you really mean the end of string. Like, character-wise. And if I change “$” to “\z” in the code above, I get:
my ($pre, $match, $post) = ($str =~ /^($nonn*)(.*?)($nonn*)\z/s);
and the test run gives:
$ ./try.pl pre = "" match = ". " post = ""
The working horse is back on track again.
What I really wanted to do
Since I messed up with this regex, I should maybe explain what it does:
my $nonn = qr/[ \t]|(?<!\n)\n(?!\n)/;
First, let’s note that $nonn only matches one character (or none): It’s either a plain space, a tab or a newline. But what’s the mess with the newline?
The “(?<!\n)\n(?!\n)” part says this: Match a \n character that isn’t preceded by a \n, and isn’t followed by a \n. Or in other words, match a newline only if it isn’t part of a sequence of newlines. Only if it’s one, isolated \n.
No double \n. Or for short, “nonn”.
I needed this for a script that handles multiple newlines later on (in LaTeX, a double newline means a new paragraph, that’s the reason).
And it actually worked. The “\n\n” part in the string wasn’t matched into neither $pre nor $post. But the (.*?), which attempts to match as little as possible, sold off the last newline to $. Tricky stuff.