Detect encoding and make everything UTF-8

前端未结

关注

 24  2399

暗喜

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

相关标签:

24条回答

感情败类

2020-11-22 03:05
If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

Usage:
```
require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
```
Download:

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

Usage:
```
require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
```
Examples:
```
echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");
```
will output:
```
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
```
I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2020-11-22 03:06
Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.

Here's some pseudocode of what you did:
```
$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);
```
You should try:
1. detect encoding using mb_detect_encoding() or whatever you like to use
2. if it's UTF-8, convert into ISO 8859-1, and repeat step 1
3. finally, convert back into UTF-8
That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).
0 讨论(0)
发布评论:

提交评论
- 加载中...
旧巷少年郎

2020-11-22 03:06
@harpax that worked for me. In my case, this is good enough:
```
if (isUTF8($str)) { 
    echo $str; 
}
else
{
    echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

生来不讨喜

2020-11-22 03:07

This cheatsheet lists some common caveats related to UTF-8 handling in PHP: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

This function detecting multibyte characters in a string might also prove helpful (source):


function detectUTF8($string)
{
    return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}         # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )+%xs', 
    $string);
}

0 讨论(0)

一向

2020-11-22 03:07
I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here's my notes:

This is my test string:

this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chàrs to see thèm, convertèd by fùnctìon!! & that's it!

I do an INSERT to save this string on a database in a field that is set as utf8_general_ci

The character set of my page is UTF-8.

If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...

So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...

So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:

this is a "wrÃ²ng wrÃ¬tten" string bÃ¹t I nÃ¨ed to pÃ¹ 'sÃ²me' special chÃ rs to see thÃ¨m, convertÃ¨d by fÃ¹nctÃ¬on!! & that's it!

So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:
```
$finallyIDidIt = mb_convert_encoding(
  $string,
  mysql_client_encoding($resourceID),
  mb_detect_encoding($string)
);
```
Now in my database I have my string with correct encoding.

NOTE: Only note to take care of is in function mysql_client_encoding! You need to be connected to the database, because this function wants a resource ID as a parameter.

But well, I just do that re-encoding before my INSERT so for me it is not a problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤城傲影

2020-11-22 03:08

After sorting out your php scripts, don't forget to tell mysql what charset you are passing and would like to recceive.

Example: set character set utf8

Passing utf8 data to a latin1 table in a latin1 I/O session gives those nasty birdfeets. I see this every other day in oscommerce shops. Back and fourth it might seem right. But phpmyadmin will show the truth. By telling mysql what charset you are passing it will handle the conversion of mysql data for you.

How to recover existing scrambled mysql data is another thread to discuss. :)

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 4 下一页