Detect encoding and make everything UTF-8

前端 未结 24 2371
暗喜
暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

相关标签:
24条回答
  • 2020-11-22 03:15

    I find solution here http://deer.org.ua/2009/10/06/1/

    class Encoding
    {
        /**
         * http://deer.org.ua/2009/10/06/1/
         * @param $string
         * @return null
         */
        public static function detect_encoding($string)
        {
            static $list = ['utf-8', 'windows-1251'];
    
            foreach ($list as $item) {
                try {
                    $sample = iconv($item, $item, $string);
                } catch (\Exception $e) {
                    continue;
                }
                if (md5($sample) == md5($string)) {
                    return $item;
                }
            }
            return null;
        }
    }
    
    $content = file_get_contents($file['tmp_name']);
    $encoding = Encoding::detect_encoding($content);
    if ($encoding != 'utf-8') {
        $result = iconv($encoding, 'utf-8', $content);
    } else {
        $result = $content;
    }
    

    I think that @ is bad decision, and make some changes to solution from deer.org.ua;

    0 讨论(0)
  • 2020-11-22 03:16

    The most voted answer doesn't work. Here is mine and hope it helps.

    function toUTF8($raw) {
        try{
            return mb_convert_encoding($raw, "UTF-8", "auto"); 
        }catch(\Exception $e){
            return mb_convert_encoding($raw, "UTF-8", "GBK"); 
        }
    }
    
    0 讨论(0)
  • 2020-11-22 03:17

    Ÿ is Mojibake for ß. In your database, you may have hex

    DF if the column is "latin1",
    C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
    C383C5B8 if double-encoded into a utf8 column
    

    You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.

    If MySQL is involved, see: Trouble with utf8 characters; what I see is not what I stored

    0 讨论(0)
  • 2020-11-22 03:18

    php.net/mb_detect_encoding

    echo mb_detect_encoding($str, "auto");
    

    or

    echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");
    

    i really don't know what the results are, but i'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.

    update
    auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". it returns the detected charset, which you can use to convert the string to utf-8 with iconv.

    <?php
    function convertToUTF8($str) {
        $enc = mb_detect_encoding($str);
    
        if ($enc && $enc != 'UTF-8') {
            return iconv($enc, 'UTF-8', $str);
        } else {
            return $str;
        }
    }
    ?>
    

    i haven't tested it, so no guarantee. and maybe there's a simpler way.

    0 讨论(0)
  • 2020-11-22 03:19

    I had same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:

    $html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;
    

    mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.

    0 讨论(0)
  • 2020-11-22 03:20

    You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.


    Edit   Here is what I probably would do:

    I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

    $url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';
    
    $accept = array(
        'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
        'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
    );
    $header = array(
        'Accept: '.implode(', ', $accept['type']),
        'Accept-Charset: '.implode(', ', $accept['charset']),
    );
    $encoding = null;
    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_HEADER, true);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
    $response = curl_exec($curl);
    if (!$response) {
        // error fetching the response
    } else {
        $offset = strpos($response, "\r\n\r\n");
        $header = substr($response, 0, $offset);
        if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
            // error parsing the response
        } else {
            if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
                // type not accepted
            }
            $encoding = trim($match[2], '"\'');
        }
        if (!$encoding) {
            $body = substr($response, $offset + 4);
            if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
                $encoding = trim($match[1], '"\'');
            }
        }
        if (!$encoding) {
            $encoding = 'utf-8';
        } else {
            if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
                // encoding not accepted
            }
            if ($encoding != 'utf-8') {
                $body = mb_convert_encoding($body, 'utf-8', $encoding);
            }
        }
        $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
        if (!$simpleXML) {
            // parse error
        } else {
            echo $simpleXML->asXML();
        }
    }
    
    0 讨论(0)
提交回复
热议问题