Detect encoding and make everything UTF-8

前端未结

关注

 24  2371

暗喜

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

相关标签:

24条回答

小鲜肉

2020-11-22 03:15

I find solution here http://deer.org.ua/2009/10/06/1/

class Encoding
{
    /**
     * http://deer.org.ua/2009/10/06/1/
     * @param $string
     * @return null
     */
    public static function detect_encoding($string)
    {
        static $list = ['utf-8', 'windows-1251'];

        foreach ($list as $item) {
            try {
                $sample = iconv($item, $item, $string);
            } catch (\Exception $e) {
                continue;
            }
            if (md5($sample) == md5($string)) {
                return $item;
            }
        }
        return null;
    }
}

$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
    $result = iconv($encoding, 'utf-8', $content);
} else {
    $result = $content;
}

I think that @ is bad decision, and make some changes to solution from deer.org.ua;

0 讨论(0)

盖世英雄少女心

2020-11-22 03:16

The most voted answer doesn't work. Here is mine and hope it helps.

function toUTF8($raw) {
    try{
        return mb_convert_encoding($raw, "UTF-8", "auto"); 
    }catch(\Exception $e){
        return mb_convert_encoding($raw, "UTF-8", "GBK"); 
    }
}

0 讨论(0)

栀梦

2020-11-22 03:17
ÂŸ is Mojibake for ß. In your database, you may have hex
```
DF if the column is "latin1",
C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
C383C5B8 if double-encoded into a utf8 column
```
You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.

If MySQL is involved, see: Trouble with utf8 characters; what I see is not what I stored
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-11-22 03:18
php.net/mb_detect_encoding
```
echo mb_detect_encoding($str, "auto");
```
or
```
echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");
```
i really don't know what the results are, but i'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.

update
auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". it returns the detected charset, which you can use to convert the string to utf-8 with iconv.
```
<?php
function convertToUTF8($str) {
    $enc = mb_detect_encoding($str);

    if ($enc && $enc != 'UTF-8') {
        return iconv($enc, 'UTF-8', $str);
    } else {
        return $str;
    }
}
?>
```
i haven't tested it, so no guarantee. and maybe there's a simpler way.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2020-11-22 03:19
I had same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:
```
$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;
```
mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.
0 讨论(0)
发布评论:

提交评论
- 加载中...

甜味超标

2020-11-22 03:20

You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.

Edit Here is what I probably would do:

I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';

$accept = array(
    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
    'Accept: '.implode(', ', $accept['type']),
    'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
    // error fetching the response
} else {
    $offset = strpos($response, "\r\n\r\n");
    $header = substr($response, 0, $offset);
    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
        // error parsing the response
    } else {
        if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
            // type not accepted
        }
        $encoding = trim($match[2], '"\'');
    }
    if (!$encoding) {
        $body = substr($response, $offset + 4);
        if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
            $encoding = trim($match[1], '"\'');
        }
    }
    if (!$encoding) {
        $encoding = 'utf-8';
    } else {
        if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
            // encoding not accepted
        }
        if ($encoding != 'utf-8') {
            $body = mb_convert_encoding($body, 'utf-8', $encoding);
        }
    }
    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
    if (!$simpleXML) {
        // parse error
    } else {
        echo $simpleXML->asXML();
    }
}

0 讨论(0)