I\'m reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the fee
When you try to handle multi languages like Japanese and Korean you might get in trouble. mb_convert_encoding with 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.
I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.
The below snippet extracts title element from a web page. If you would like to convert entire page, then you may want to remove some lines.
find('title', 0);
if (empty($title)) {
return null;
}
$title = $title->plaintext;
$metas = $dom->find('meta');
$charset = 'auto';
foreach ($metas as $meta) {
if (!empty($meta->charset)) { // html5
$charset = $meta->charset;
} else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
$charset = $match[1];
}
}
if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
$charset = 'auto';
}
return mb_convert_encoding($title, 'UTF-8', $charset);
}