Detect if a string was double-encoded in UTF-8

前端 未结 3 561
遇见更好的自我
遇见更好的自我 2021-01-18 05:19

I need to process a large list of short strings (mostly in Russian, but any other language is possible, including random garbage from a cat walking on keyboard).

Som

3条回答
  •  离开以前
    2021-01-18 05:39

    Here's a PHP algorithm that worked for me.

    It's better to fix your data but if you can't here's a trick:

    if ( mb_detect_encoding( utf8_decode( $value ) ) === 'UTF-8' ) {
        // Double encoded, or bad encoding
        $value = utf8_decode( $value );
    }
    
    $value = \ForceUTF8\Encoding::toUTF8( $value );
    

    The library I'm using is: https://github.com/neitanod/forceutf8/

提交回复
热议问题