I need to process a large list of short strings (mostly in Russian, but any other language is possible, including random garbage from a cat walking on keyboard).
Som
Here's a PHP algorithm that worked for me.
It's better to fix your data but if you can't here's a trick:
if ( mb_detect_encoding( utf8_decode( $value ) ) === 'UTF-8' ) {
// Double encoded, or bad encoding
$value = utf8_decode( $value );
}
$value = \ForceUTF8\Encoding::toUTF8( $value );
The library I'm using is: https://github.com/neitanod/forceutf8/