Detect encoding and make everything UTF-8

前端 未结 24 2397
暗喜
暗喜 2020-11-22 03:03

I\'m reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the fee

24条回答
  •  情话喂你
    2020-11-22 03:11

    Detecting the encoding is hard.

    mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

    As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

    Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.

提交回复
热议问题