How to detect latin1 and UTF-8?

我只是一个虾纸丫 提交于 2019-11-29 12:11:05

Due to some properties of UTF-8, it's very unlikely that text encoded using iso-8859-1 would be valid UTF-8 unless it decodes identically using both encodings[1].

As such, the solution is to try decoding it using UTF-8. If it fails, decode it using iso-8859-1 instead. Since decoding using iso-8859-1 is a no-op, I'll be skipping that step.

  • utf8:: implementation:

    my $decoded_text = $utf8_or_latin1;
    utf8::decode($decoded_text);
    
  • Encode:: implementation:

    use Encode qw( decode_utf8 );
    
    my $decoded_text =
       eval { decode_utf8($utf8_or_latin1, Encode::FB_CROAK|Encode::LEAVE_SRC) }
          // $utf8_or_latin1;
    

Now, you say you want UTF-8. UTF-8 is obtained from encoding decoded text.

  • utf8:: implementation:

    my $utf8 = $decoded_text;
    utf8::encode($utf8);
    
  • Encode:: implementation:

    use Encode qw( encode_utf8 );
    
    my $utf8 = encode_utf8($decoded_text);
    

Notes

  1. Assuming the text is either valid UTF-8 or valid iso-8859-1, my solution would only guess wrong if all of the following are true:

    • The text is encoded using iso-8859-1 (as opposed to UTF-8),
    • At least one of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
      ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
      àáâãäåæçèéêëìíîïðñòóôõö÷
      ] is present,
    • All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß] are followed by one of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • All instances of [àáâãäåæçèéêëìíîï] are followed by two of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • All instances of [ðñòóôõö÷] are followed by three of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • None of [øùúûüýþÿ] are present, and
    • None of [
      <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
      <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
      <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
      ] are present except where previously mentioned.

    (<80>..<9F> are unassigned or unprintable control characters, not sure which.)

    In other words, that code is very reliable.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!