I\'m reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the fee
Detecting the encoding is hard.
mb_detect_encoding
works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.
As long as you only deal with Western European languages, the three major encodings to consider are utf-8
, iso-8859-1
and cp-1252
. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding
(note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding
to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252
.
Once you've detected the encoding you need to convert it to your internal representation (UTF-8
is the only sane choice). The function utf8_encode
transforms ISO-8859-1
to UTF-8
, so it can only used for that particular input type. For other encodings, use mb_convert_encoding
.