I\'m reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the fee
A really nice way to implement an isUTF8
-function can be found on php.net:
function isUTF8($string) {
return (utf8_encode(utf8_decode($string)) == $string);
}
You need to test the character set on input since responses can come coded with different encodings.
I force all content been sent into UTF-8 by doing detection and translation using the following function:
function fixRequestCharset()
{
$ref = array(&$_GET, &$_POST, &$_REQUEST);
foreach ($ref as &$var)
{
foreach ($var as $key => $val)
{
$encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
if (!$encoding)
continue;
if (strcasecmp($encoding, 'UTF-8') != 0)
{
$encoding = iconv($encoding, 'UTF-8', $var[$key]);
if ($encoding === false)
continue;
$var[$key] = $encoding;
}
}
}
}
That routine will turn all PHP variables that come from the remote host into UTF-8.
Or ignore the value if the encoding could not be detected or converted.
You can customize it to your needs.
Just invoke it before using the variables.
I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.
Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s.
//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
$process = array(&$_GET, &$_POST, &$_REQUEST);
while (list($key, $val) = each($process)) {
foreach ($val as $k => $v) {
unset($process[$key][$k]);
if (is_array($v)) {
$process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
$process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
} else {
$process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
}
}
}
unset($process);
}
catch(Exception $ex){}
Detecting the encoding is hard.
mb_detect_encoding
works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.
As long as you only deal with Western European languages, the three major encodings to consider are utf-8
, iso-8859-1
and cp-1252
. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding
(note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding
to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252
.
Once you've detected the encoding you need to convert it to your internal representation (UTF-8
is the only sane choice). The function utf8_encode
transforms ISO-8859-1
to UTF-8
, so it can only used for that particular input type. For other encodings, use mb_convert_encoding
.
Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.
So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).
The interesting thing about mb_detect_encoding
and mb_convert_encoding
is that the order of the encodings you suggest does matter:
// $input is actually UTF-8
mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)
mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)
So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.