Perl script rectifying the encoding of a mixed UTF-8 / windows-1255 file

This post was written by eli on January 13, 2013
Posted Under: Linux,perl,Software

Suppose that some hodge-podge of scripts and template files create a single file, which has multiple encoding of Hebrew. That is, both UTF-8 and windos-1255. If the different encodings don’t appear in the same lines, the following script makes it all UTF-8:

#!/usr/bin/perl
use strict;
use warnings;

use Encode;
binmode(STDOUT, ":utf8");

while (defined (my $l = <>)) {
 eval { print decode("utf-8", $l, Encode::FB_CROAK); };
 next unless ($@);

 eval { print decode("windows-1255", $l, Encode::FB_CROAK); };

 if ($@) {
    print STDERR "Failed to decode line $.: $l\n";
    print $l;
  }
}

The binmode() call mutes warnings about wide characters (UTF-8) going to standard output.

The first decode() call does nothing if the handled line contains pure ASCII or UTF-8 encoded characters. The thing about it is the third argument, CHECK, which is set to FB_CROAK (man Encode for details). This tells decode() to die() if a malformed character is encountered. Being enclosed in an eval { }, it’s just a test. If this no-operation decoding goes by peacefully (that is, $@ is undefined) we know that the line was printed and one can go on to the next line.

If it does fail, a second attempt to decode() takes place, this time from the selected encoding. If it fails, the line is printed as is, and a warning is issued to standard output.

As a final note, it looks like this script could be improved by using FB_QUIET instead. This option makes decode() run as far as it can, and overwrites the input variable to contain the undecoded part. So this could be a method to munch through a string chunk by chunk, trying different encodings each time. Or so says the manual page. I never tried it.

 

Add a Comment

required, use real name
required, will not be published
optional, your blog address