Detect encoding and make everything UTF-8

前端 未结 24 2399
暗喜
暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

相关标签:
24条回答
  • 2020-11-22 03:05

    If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

    I made a function that addresses all this issues. It´s called Encoding::toUTF8().

    You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

    I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

    Usage:

    require_once('Encoding.php');
    use \ForceUTF8\Encoding;  // It's namespaced now.
    
    $utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
    
    $latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
    

    Download:

    https://github.com/neitanod/forceutf8

    I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

    Usage:

    require_once('Encoding.php');
    use \ForceUTF8\Encoding;  // It's namespaced now.
    
    $utf8_string = Encoding::fixUTF8($garbled_utf8_string);
    

    Examples:

    echo Encoding::fixUTF8("Fédération Camerounaise de Football");
    echo Encoding::fixUTF8("Fédération Camerounaise de Football");
    echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
    echo Encoding::fixUTF8("Fédération Camerounaise de Football");
    

    will output:

    Fédération Camerounaise de Football
    Fédération Camerounaise de Football
    Fédération Camerounaise de Football
    Fédération Camerounaise de Football
    

    I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

    0 讨论(0)
  • 2020-11-22 03:06

    Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.

    Here's some pseudocode of what you did:

    $inputstring = getFromUser();
    $utf8string = iconv($current_encoding, 'utf-8', $inputstring);
    $flawedstring = iconv($current_encoding, 'utf-8', $utf8string);
    

    You should try:

    1. detect encoding using mb_detect_encoding() or whatever you like to use
    2. if it's UTF-8, convert into ISO 8859-1, and repeat step 1
    3. finally, convert back into UTF-8

    That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

    This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

    The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).

    0 讨论(0)
  • 2020-11-22 03:06

    @harpax that worked for me. In my case, this is good enough:

    if (isUTF8($str)) { 
        echo $str; 
    }
    else
    {
        echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
    }
    
    0 讨论(0)
  • 2020-11-22 03:07

    This cheatsheet lists some common caveats related to UTF-8 handling in PHP: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

    This function detecting multibyte characters in a string might also prove helpful (source):

    
    function detectUTF8($string)
    {
        return preg_match('%(?:
            [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |\xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
            |\xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |\xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            |[\xF1-\xF3][\x80-\xBF]{3}         # planes 4-15
            |\xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )+%xs', 
        $string);
    }
    

    0 讨论(0)
  • 2020-11-22 03:07

    I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here's my notes:

    This is my test string:

    this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chàrs to see thèm, convertèd by fùnctìon!! & that's it!

    I do an INSERT to save this string on a database in a field that is set as utf8_general_ci

    The character set of my page is UTF-8.

    If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...

    So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...

    So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:

    this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chà rs to see thèm, convertèd by fùnctìon!! & that's it!

    So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:

    $finallyIDidIt = mb_convert_encoding(
      $string,
      mysql_client_encoding($resourceID),
      mb_detect_encoding($string)
    );
    

    Now in my database I have my string with correct encoding.

    NOTE: Only note to take care of is in function mysql_client_encoding! You need to be connected to the database, because this function wants a resource ID as a parameter.

    But well, I just do that re-encoding before my INSERT so for me it is not a problem.

    0 讨论(0)
  • 2020-11-22 03:08

    After sorting out your php scripts, don't forget to tell mysql what charset you are passing and would like to recceive.

    Example: set character set utf8

    Passing utf8 data to a latin1 table in a latin1 I/O session gives those nasty birdfeets. I see this every other day in oscommerce shops. Back and fourth it might seem right. But phpmyadmin will show the truth. By telling mysql what charset you are passing it will handle the conversion of mysql data for you.

    How to recover existing scrambled mysql data is another thread to discuss. :)

    0 讨论(0)
提交回复
热议问题