Detect encoding and make everything UTF-8

前端 未结 24 2409
暗喜
暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

24条回答
  •  -上瘾入骨i
    2020-11-22 03:23

    When you try to handle multi languages like Japanese and Korean you might get in trouble. mb_convert_encoding with 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.

    I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.

    The below snippet extracts title element from a web page. If you would like to convert entire page, then you may want to remove some lines.

    find('title', 0);
        if (empty($title)) {
            return null;
        }
        $title = $title->plaintext;
        $metas = $dom->find('meta');
        $charset = 'auto';
        foreach ($metas as $meta) {
            if (!empty($meta->charset)) { // html5
                $charset = $meta->charset;
            } else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
                $charset = $match[1];
            }
        }
        if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
            $charset = 'auto';
        }
        return mb_convert_encoding($title, 'UTF-8', $charset);
    }
    

提交回复
热议问题