Fixing a file consisting of both UTF-8 and Windows-1252

后端 未结 3 1066
遇见更好的自我
遇见更好的自我 2020-12-01 19:16

I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp125

相关标签:
3条回答
  • 2020-12-01 19:29

    This is one of the reasons I wrote Unicode::UTF8. With Unicode::UTF8 this is trivial using the fallback option in Unicode::UTF8::decode_utf8().

    use Unicode::UTF8 qw[decode_utf8];
    use Encode        qw[decode];
    
    print "UTF-8 mixed with Latin-1 (ISO-8859-1):\n";
    for my $octets ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
        no warnings 'utf8';
        printf "U+%v04X\n", decode_utf8($octets, sub { $_[0] });
    }
    
    print "\nUTF-8 mixed with CP-1252 (Windows-1252):\n";
    for my $octets ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
        no warnings 'utf8';
        printf "U+%v04X\n", decode_utf8($octets, sub { decode('CP-1252', $_[0]) });
    }
    

    Output:

    UTF-8 mixed with Latin-1 (ISO-8859-1):
    U+00D0.0020.0092.0020.0412.000A
    U+0412.000A
    
    UTF-8 mixed with CP-1252 (Windows-1252):
    U+00D0.0020.2019.0020.0412.000A
    U+0412.000A
    

    Unicode::UTF8 is written in C/XS and only invokes the callback/fallback when encountering an Ill-formed UTF-8 sequence.

    0 讨论(0)
  • 2020-12-01 19:29

    Recently I came across files with a severe mix of UTF-8, CP1252, and UTF-8 encoded, then interpreted as CP1252, then that encoded as UTF-8 again, that interpreted as CP1252 again, and so forth.

    I wrote the below code, which worked well for me. It looks for typical UTF-8 byte sequences, even if some of the bytes are not UTF-8, but the Unicode representation of the equivalent CP1252 byte.

    my %cp1252Encoding = (
    # replacing the unicode code with the original CP1252 code
    # see e.g. http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
    "\x{20ac}" => "\x80",
    "\x{201a}" => "\x82",
    "\x{0192}" => "\x83",
    "\x{201e}" => "\x84",
    "\x{2026}" => "\x85",
    "\x{2020}" => "\x86",
    "\x{2021}" => "\x87",
    "\x{02c6}" => "\x88",
    "\x{2030}" => "\x89",
    "\x{0160}" => "\x8a",
    "\x{2039}" => "\x8b",
    "\x{0152}" => "\x8c",
    "\x{017d}" => "\x8e",
    
    "\x{2018}" => "\x91",
    "\x{2019}" => "\x92",
    "\x{201c}" => "\x93",
    "\x{201d}" => "\x94",
    "\x{2022}" => "\x95",
    "\x{2013}" => "\x96",
    "\x{2014}" => "\x97",
    "\x{02dc}" => "\x98",
    "\x{2122}" => "\x99",
    "\x{0161}" => "\x9a",
    "\x{203a}" => "\x9b",
    "\x{0153}" => "\x9c",
    "\x{017e}" => "\x9e",
    "\x{0178}" => "\x9f",
    );
    my $re = join "|", keys %cp1252Encoding;
    $re = qr/$re/;
    my %cp1252Decoding = reverse % cp1252Encoding;
    my $cp1252Characters = join "|", keys %cp1252Decoding;
    
    sub decodeUtf8
    {
        my ($str) = @_;
    
        $str =~ s/$re/ $cp1252Encoding{$&} /eg;
        utf8::decode($str);
        return $str;
    }
    
    sub fixString
    {
        my ($str) = @_;
    
        my $r = qr/[\x80-\xBF]|$re/;
    
        my $current;
        do {
            $current = $str;
    
            # If this matches, the string is likely double-encoded UTF-8. Try to decode
            $str =~ s/[\xF0-\xF7]$r$r$r|[\xE0-\xEF]$r$r|[\xC0-\xDF]$r/ decodeUtf8($&) /eg;
    
        } while ($str ne $current);
    
        # decodes any possible left-over cp1252 codes to Unicode
        $str =~ s/$cp1252Characters/ $cp1252Decoding{$&} /eg;
        return $str;
    }
    

    This has similar limitations as ikegami's answer, except that the same limitations are also applicable to UTF-8 encoded strings.

    0 讨论(0)
  • 2020-12-01 19:40

    Yes!

    Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.

    A line can contain a mix of encodings

    Encoding::FixLatin provides a function named fix_latin which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.

    $ perl -e'
       use Encoding::FixLatin qw( fix_latin );
       $bytes = "\xD0 \x92 \xD0\x92\n";
       $text = fix_latin($bytes);
       printf("U+%v04X\n", $text);
    '
    U+00D0.0020.2019.0020.0412.000A
    

    Heuristics are employed, but they are fairly reliable. Only the following cases will fail:

    • One of
      [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
      encoded using iso-8859-1 or cp1252, followed by one of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
      encoded using iso-8859-1 or cp1252.

    • One of
      [àáâãäåæçèéêëìíîï]
      encoded using iso-8859-1 or cp1252, followed by two of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
      encoded using iso-8859-1 or cp1252.

    • One of
      [ðñòóôõö÷]
      encoded using iso-8859-1 or cp1252, followed by two of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
      encoded using iso-8859-1 or cp1252.

    The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.

    $ perl -e'
       use Encode qw( decode_utf8 encode_utf8 decode );
       $bytes = "\xD0 \x92 \xD0\x92\n";
       $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
       printf("U+%v04X\n", $text);
    '
    U+00D0.0020.2019.0020.0412.000A
    

    Each line only uses one encoding

    fix_latin works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.

    $ perl -e'
       use Encode qw( decode );
       for $bytes ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
          if (!eval {
             $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
             1  # No exception
          }) {
             $text = decode("cp1252", $bytes);
          }
    
          printf("U+%v04X\n", $text);
       }
    '
    U+00D0.0020.2019.0020.00D0.2019.000A
    U+0412.000A
    

    Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:

    • The line is encoded using iso-8859-1 or cp1252,

    • At least one of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
      is present in the line,

    • All instances of
      [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
      are always followed by exactly one of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

    • All instances of
      [àáâãäåæçèéêëìíîï]
      are always followed by exactly two of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

    • All instances of
      [ðñòóôõö÷]
      are always followed by exactly three of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

    • None of
      [øùúûüýþÿ]
      are present in the line, and

    • None of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
      are present in the line except where previously mentioned.


    Notes:

    • Encoding::FixLatin installs command line tool fix_latin to convert files, and it would be trivial to write one using the second approach.
    • fix_latin (both the function and the file) can be sped up by installing Encoding::FixLatin::XS.
    • The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.
    0 讨论(0)
提交回复
热议问题