Perl: Matching apparently plain space in HTML with regular expression
I’ve been using a plain space character in Perl regular expressions since ages, and it has always worked. Something like this for finding double spaces:
my @doubles = ($t =~ / {2,}/g);
or for emphasis on the space character, equivalently:
my @doubles = ($t =~ /[ ]{2,}/g);
but then I began processing HTML representation from the Mojo::DOM module (or TinyMCE’s output directly) and this just didn’t work. That is, \s detected the spaces (with Perl 5.26) but the plain space character didn’t.
As it turns out, TinyMCE put instead of the first space (when there was a pair of them), which Mojo::DOM correctly translated to the 0xa0 Unicode character (0xc2, 0xa0 in UTF-8). Hence no chance that a plain space, i.e. a 0x20, will match it. Perl was clever enough to match it as a whitespace (with \s).
Solution: Simple. Just go
my @doubles = ($t =~ /[ \xa0]{2,}/g);
In other words, match either the good old space or the non-breakable space.