I\'m reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the fee
I find solution here http://deer.org.ua/2009/10/06/1/
class Encoding
{
/**
* http://deer.org.ua/2009/10/06/1/
* @param $string
* @return null
*/
public static function detect_encoding($string)
{
static $list = ['utf-8', 'windows-1251'];
foreach ($list as $item) {
try {
$sample = iconv($item, $item, $string);
} catch (\Exception $e) {
continue;
}
if (md5($sample) == md5($string)) {
return $item;
}
}
return null;
}
}
$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
$result = iconv($encoding, 'utf-8', $content);
} else {
$result = $content;
}
I think that @ is bad decision, and make some changes to solution from deer.org.ua;
The most voted answer doesn't work. Here is mine and hope it helps.
function toUTF8($raw) {
try{
return mb_convert_encoding($raw, "UTF-8", "auto");
}catch(\Exception $e){
return mb_convert_encoding($raw, "UTF-8", "GBK");
}
}
Ÿ
is Mojibake for ß
. In your database, you may have hex
DF if the column is "latin1",
C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
C383C5B8 if double-encoded into a utf8 column
You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.
If MySQL is involved, see: Trouble with utf8 characters; what I see is not what I stored
php.net/mb_detect_encoding
echo mb_detect_encoding($str, "auto");
or
echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");
i really don't know what the results are, but i'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding
works or not.
update
auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". it returns the detected charset, which you can use to convert the string to utf-8 with iconv.
<?php
function convertToUTF8($str) {
$enc = mb_detect_encoding($str);
if ($enc && $enc != 'UTF-8') {
return iconv($enc, 'UTF-8', $str);
} else {
return $str;
}
}
?>
i haven't tested it, so no guarantee. and maybe there's a simpler way.
I had same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:
$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;
mb_internal_encoding('UTF-8')
, phpQuery::newDocumentHTML($html, 'utf-8')
, mbstring.internal_encoding
and other manipulations didn't take any effect.
You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset
parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding
attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.
Edit Here is what I probably would do:
I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type
header field that contains the MIME type and (hopefully) the charset
parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding
attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.
$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';
$accept = array(
'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
'Accept: '.implode(', ', $accept['type']),
'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
// error fetching the response
} else {
$offset = strpos($response, "\r\n\r\n");
$header = substr($response, 0, $offset);
if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
// error parsing the response
} else {
if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
// type not accepted
}
$encoding = trim($match[2], '"\'');
}
if (!$encoding) {
$body = substr($response, $offset + 4);
if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
$encoding = trim($match[1], '"\'');
}
}
if (!$encoding) {
$encoding = 'utf-8';
} else {
if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
// encoding not accepted
}
if ($encoding != 'utf-8') {
$body = mb_convert_encoding($body, 'utf-8', $encoding);
}
}
$simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
if (!$simpleXML) {
// parse error
} else {
echo $simpleXML->asXML();
}
}