I\'m working on a web crawler that grabs data from sites all over the world, and is dealing with distinct languages and encodings.
Currently I\'m using the following
Rather than blindly trying to detect the encoding, you should first check if the page that you downloaded has a listed character set. The character set may be set in the HTTP response header, for example:
Content-Type:text/html; charset=utf-8
Or in the HTML as a meta tag, for example:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Only if neither are available then try to guess the encoding with mb_detect_encoding() or other methods.
It's not possible to detect character set of a string in 100% rate since some character sets are subset of some others. Try setting character set explicitly if possible without mixing iconv and mbstring functions. I recommend using a function like this and supplying from charset whenever possible:
function convertEncoding($str, $from = 'auto', $to = "UTF-8") {
if($from == 'auto') $from = mb_detect_encoding($str);
return mb_convert_encoding ($str , $to, $from);
}
You can try utf_encode($str).
http://www.php.net/manual/en/function.utf8-encode.php#89789
Or you can replace the content type meta tag with
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
from header of crawled content