Perl: Matching apparently plain space in HTML with regular expression

This post was written by eli on January 5, 2022
Posted Under: Internet,perl

I’ve been using a plain space character in Perl regular expressions since ages, and it has always worked. Something like this for finding double spaces:

my @doubles = ($t =~ / {2,}/g);

or for emphasis on the space character, equivalently:

my @doubles = ($t =~ /[ ]{2,}/g);

but then I began processing HTML representation from the Mojo::DOM module (or TinyMCE’s output directly) and this just didn’t work. That is, \s detected the spaces (with Perl 5.26) but the plain space character didn’t.

As it turns out, TinyMCE put   instead of the first space (when there was a pair of them), which Mojo::DOM correctly translated to the 0xa0 Unicode character (0xc2, 0xa0 in UTF-8). Hence no chance that a plain space, i.e. a 0x20, will match it. Perl was clever enough to match it as a whitespace (with \s).

Solution: Simple. Just go

my @doubles = ($t =~ /[ \xa0]{2,}/g);

In other words, match either the good old space or the non-breakable space.

Add a Comment

required, use real name
required, will not be published
optional, your blog address