Detect encoding and make everything UTF-8

前端 未结 24 2370
暗喜
暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

相关标签:
24条回答
  • 2020-11-22 03:23

    When you try to handle multi languages like Japanese and Korean you might get in trouble. mb_convert_encoding with 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.

    I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.

    The below snippet extracts title element from a web page. If you would like to convert entire page, then you may want to remove some lines.

    <?php
    require_once 'simple_html_dom.php';
    
    echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;
    
    function convert_title_to_utf8($contents)
    {
        $dom = str_get_html($contents);
        $title = $dom->find('title', 0);
        if (empty($title)) {
            return null;
        }
        $title = $title->plaintext;
        $metas = $dom->find('meta');
        $charset = 'auto';
        foreach ($metas as $meta) {
            if (!empty($meta->charset)) { // html5
                $charset = $meta->charset;
            } else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
                $charset = $match[1];
            }
        }
        if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
            $charset = 'auto';
        }
        return mb_convert_encoding($title, 'UTF-8', $charset);
    }
    
    0 讨论(0)
  • 2020-11-22 03:26

    Try without 'auto'

    That is:

    mb_detect_encoding($text)
    

    instead of:

    mb_detect_encoding($text, 'auto')
    

    More information can be found here: mb_detect_encoding

    0 讨论(0)
  • 2020-11-22 03:27

    A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.

    This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.

    Take a look at mysql_set_charset. It may help you.

    0 讨论(0)
  • 2020-11-22 03:28

    This version is for German language but you can modifiy the $CHARSETS and the $TESTCHARS

    class CharsetDetector
    {
    private static $CHARSETS = array(
    "ISO_8859-1",
    "ISO_8859-15",
    "CP850"
    );
    private static $TESTCHARS = array(
    "€",
    "ä",
    "Ä",
    "ö",
    "Ö",
    "ü",
    "Ü",
    "ß"
    );
    public static function convert($string)
    {
        return self::__iconv($string, self::getCharset($string));
    }
    public static function getCharset($string)
    {
        $normalized = self::__normalize($string);
        if(!strlen($normalized))return "UTF-8";
        $best = "UTF-8";
        $charcountbest = 0;
        foreach (self::$CHARSETS as $charset) {
            $str = self::__iconv($normalized, $charset);
            $charcount = 0;
            $stop   = mb_strlen( $str, "UTF-8");
    
            for( $idx = 0; $idx < $stop; $idx++)
            {
                $char = mb_substr( $str, $idx, 1, "UTF-8");
                foreach (self::$TESTCHARS as $testchar) {
    
                    if($char == $testchar)
                    {
    
                        $charcount++;
                        break;
                    }
                }
            }
            if($charcount>$charcountbest)
            {
                $charcountbest=$charcount;
                $best=$charset;
            }
            //echo $text."<br />";
        }
        return $best;
    }
    private static function __normalize($str)
    {
    
    $len = strlen($str);
    $ret = "";
    for($i = 0; $i < $len; $i++){
        $c = ord($str[$i]);
        if ($c > 128) {
            if (($c > 247)) $ret .=$str[$i];
            elseif ($c > 239) $bytes = 4;
            elseif ($c > 223) $bytes = 3;
            elseif ($c > 191) $bytes = 2;
            else $ret .=$str[$i];
            if (($i + $bytes) > $len) $ret .=$str[$i];
            $ret2=$str[$i];
            while ($bytes > 1) {
                $i++;
                $b = ord($str[$i]);
                if ($b < 128 || $b > 191) {$ret .=$ret2; $ret2=""; $i+=$bytes-1;$bytes=1; break;}
                else $ret2.=$str[$i];
                $bytes--;
            }
        }
    }
    return $ret; 
    }
    private static function __iconv($string, $charset)
    {
        return iconv ( $charset, "UTF-8" , $string );
    }
    }
    

    0 讨论(0)
  • 2020-11-22 03:31

    It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.

    So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode.

    However, if you're fetching an UTF-8 feed, you don't need to do anything.

    0 讨论(0)
  • 2020-11-22 03:31

    Get encoding from headers and convert it to utf-8.

    $post_url='http://website.domain';
    
    /// Get headers ////////////////////////////////////////////////////////////
    function get_headers_curl($url) 
    { 
        $ch = curl_init(); 
    
        curl_setopt($ch, CURLOPT_URL,            $url); 
        curl_setopt($ch, CURLOPT_HEADER,         true); 
        curl_setopt($ch, CURLOPT_NOBODY,         true); 
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
        curl_setopt($ch, CURLOPT_TIMEOUT,        15); 
    
        $r = curl_exec($ch); 
        return $r; 
    }
    $the_header = get_headers_curl($post_url);
    /// check for redirect /////////////////////////////////////////////////
    if (preg_match("/Location:/i", $the_header)) {
        $arr = explode('Location:', $the_header);
        $location = $arr[1];
    
        $location=explode(chr(10), $location);
        $location = $location[0];
    
    $the_header = get_headers_curl(trim($location));
    }
    /// Get charset /////////////////////////////////////////////////////////////////////
    if (preg_match("/charset=/i", $the_header)) {
        $arr = explode('charset=', $the_header);
        $charset = $arr[1];
    
        $charset=explode(chr(10), $charset);
        $charset = $charset[0];
        }
    ///////////////////////////////////////////////////////////////////////////////
    // echo $charset;
    
    if($charset && $charset!='UTF-8') { $html = iconv($charset, "UTF-8", $html); }
    
    0 讨论(0)
提交回复
热议问题