I\'m reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the fee
When you try to handle multi languages like Japanese and Korean you might get in trouble. mb_convert_encoding with 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.
I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.
The below snippet extracts title element from a web page. If you would like to convert entire page, then you may want to remove some lines.
<?php
require_once 'simple_html_dom.php';
echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;
function convert_title_to_utf8($contents)
{
$dom = str_get_html($contents);
$title = $dom->find('title', 0);
if (empty($title)) {
return null;
}
$title = $title->plaintext;
$metas = $dom->find('meta');
$charset = 'auto';
foreach ($metas as $meta) {
if (!empty($meta->charset)) { // html5
$charset = $meta->charset;
} else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
$charset = $match[1];
}
}
if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
$charset = 'auto';
}
return mb_convert_encoding($title, 'UTF-8', $charset);
}
Try without 'auto'
That is:
mb_detect_encoding($text)
instead of:
mb_detect_encoding($text, 'auto')
More information can be found here: mb_detect_encoding
A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.
This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.
Take a look at mysql_set_charset. It may help you.
This version is for German language but you can modifiy the $CHARSETS and the $TESTCHARS
class CharsetDetector
{
private static $CHARSETS = array(
"ISO_8859-1",
"ISO_8859-15",
"CP850"
);
private static $TESTCHARS = array(
"€",
"ä",
"Ä",
"ö",
"Ö",
"ü",
"Ü",
"ß"
);
public static function convert($string)
{
return self::__iconv($string, self::getCharset($string));
}
public static function getCharset($string)
{
$normalized = self::__normalize($string);
if(!strlen($normalized))return "UTF-8";
$best = "UTF-8";
$charcountbest = 0;
foreach (self::$CHARSETS as $charset) {
$str = self::__iconv($normalized, $charset);
$charcount = 0;
$stop = mb_strlen( $str, "UTF-8");
for( $idx = 0; $idx < $stop; $idx++)
{
$char = mb_substr( $str, $idx, 1, "UTF-8");
foreach (self::$TESTCHARS as $testchar) {
if($char == $testchar)
{
$charcount++;
break;
}
}
}
if($charcount>$charcountbest)
{
$charcountbest=$charcount;
$best=$charset;
}
//echo $text."<br />";
}
return $best;
}
private static function __normalize($str)
{
$len = strlen($str);
$ret = "";
for($i = 0; $i < $len; $i++){
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247)) $ret .=$str[$i];
elseif ($c > 239) $bytes = 4;
elseif ($c > 223) $bytes = 3;
elseif ($c > 191) $bytes = 2;
else $ret .=$str[$i];
if (($i + $bytes) > $len) $ret .=$str[$i];
$ret2=$str[$i];
while ($bytes > 1) {
$i++;
$b = ord($str[$i]);
if ($b < 128 || $b > 191) {$ret .=$ret2; $ret2=""; $i+=$bytes-1;$bytes=1; break;}
else $ret2.=$str[$i];
$bytes--;
}
}
}
return $ret;
}
private static function __iconv($string, $charset)
{
return iconv ( $charset, "UTF-8" , $string );
}
}
It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.
So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode
.
However, if you're fetching an UTF-8 feed, you don't need to do anything.
Get encoding from headers and convert it to utf-8.
$post_url='http://website.domain';
/// Get headers ////////////////////////////////////////////////////////////
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
return $r;
}
$the_header = get_headers_curl($post_url);
/// check for redirect /////////////////////////////////////////////////
if (preg_match("/Location:/i", $the_header)) {
$arr = explode('Location:', $the_header);
$location = $arr[1];
$location=explode(chr(10), $location);
$location = $location[0];
$the_header = get_headers_curl(trim($location));
}
/// Get charset /////////////////////////////////////////////////////////////////////
if (preg_match("/charset=/i", $the_header)) {
$arr = explode('charset=', $the_header);
$charset = $arr[1];
$charset=explode(chr(10), $charset);
$charset = $charset[0];
}
///////////////////////////////////////////////////////////////////////////////
// echo $charset;
if($charset && $charset!='UTF-8') { $html = iconv($charset, "UTF-8", $html); }