Detect encoding and make everything UTF-8

前端未结

关注

 24  2403

暗喜

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

相关标签:

24条回答

梦谈多话

2020-11-22 03:09
A really nice way to implement an isUTF8-function can be found on php.net:
```
function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-11-22 03:10
You need to test the character set on input since responses can come coded with different encodings.

I force all content been sent into UTF-8 by doing detection and translation using the following function:
```
function fixRequestCharset()
{
  $ref = array(&$_GET, &$_POST, &$_REQUEST);
  foreach ($ref as &$var)
  {
    foreach ($var as $key => $val)
    {
      $encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
      if (!$encoding)
        continue;
      if (strcasecmp($encoding, 'UTF-8') != 0)
      {
        $encoding = iconv($encoding, 'UTF-8', $var[$key]);
        if ($encoding === false)
          continue;
        $var[$key] = $encoding;
      }
    }
  }
}
```
That routine will turn all PHP variables that come from the remote host into UTF-8.

Or ignore the value if the encoding could not be detected or converted.

You can customize it to your needs.

Just invoke it before using the variables.
0 讨论(0)
发布评论:

提交评论
- 加载中...

深忆病人

2020-11-22 03:10

I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.

Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s.

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
        foreach ($val as $k => $v) {
            unset($process[$key][$k]);
            if (is_array($v)) {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
                $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
            } else {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
            }
        }
    }
    unset($process);
}
catch(Exception $ex){}

0 讨论(0)

情话喂你

2020-11-22 03:11

Detecting the encoding is hard.

mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-11-22 03:11

Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.

So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).

0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-11-22 03:15
The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:
```
// $input is actually UTF-8

mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)

mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)
```
So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.
0 讨论(0)
发布评论:

提交评论
- 加载中...