发表新帖

发表新帖

Detect encoding and make everything UTF-8

前端未结

关注

 24  2409

暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

24条回答

-上瘾入骨i (楼主)

2020-11-22 03:23
When you try to handle multi languages like Japanese and Korean you might get in trouble. mb_convert_encoding with 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.

I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.

The below snippet extracts title element from a web page. If you would like to convert entire page, then you may want to remove some lines.
```
find('title', 0);
    if (empty($title)) {
        return null;
    }
    $title = $title->plaintext;
    $metas = $dom->find('meta');
    $charset = 'auto';
    foreach ($metas as $meta) {
        if (!empty($meta->charset)) { // html5
            $charset = $meta->charset;
        } else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
            $charset = $match[1];
        }
    }
    if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
        $charset = 'auto';
    }
    return mb_convert_encoding($title, 'UTF-8', $charset);
}
```
0 讨论(0)

查看其它24个回答
发布评论:

提交评论
- 加载中...

热议问题