Convert CESU-8 to UTF-8 with high performance

问题

I have some raw text that is usually a valid UTF-8 string. However, every now and then it turns out that the input is in fact a CESU-8 string, instead. It is possible to technically detect this and convert to UTF-8 but as this happens rarely, I would rather not spend lots of CPU time to do this.

Is there any fast method to detect if a string is encoded with CESU-8 or UTF-8? I guess I could always blindly convert "UTF-8" to UTF-16LE and then to UTF-8 using iconv() and I would probably get the correct result every time because CESU-8 is close enough to UTF-8 for this to work. Can you suggest anything faster? (I'm expecting the input string to be CESU-8 instead of valid UTF-8 around 0.01-0.1% of all string occurrences.)

(CESU-8 is a non-standard string format which contains 16-bit surrogate pairs encoded in UTF-8. Technically UTF-8 strings should contain the characters represented by those surrogate pairs, not the surrogate pairs itself.)

回答1:

Here's a more efficient version of your conversion function:

$regex = '@(\xED[\xA0-\xAF][\x80-\xBF]\xED[\xB0-\xBF][\x80-\xBF])@';
$s = preg_replace_callback($regex, function($m) {
    $in = unpack("C*", $m[0]);
    $in[2] += 1; // Effectively adds 0x10000 to the codepoint.
    return pack("C*",
        0xF0 | (($in[2] & 0x1C) >> 2),
        0x80 | (($in[2] & 0x03) << 4) | (($in[3] & 0x3C) >> 2),
        0x80 | (($in[3] & 0x03) << 4) | ($in[5] & 0x0F),
        $in[6]
    );
}, $s);

The code only converts high surrogates followed by low surrogates, and converts the two three-byte CESU-8 sequences directly into a four-byte UTF-8 sequence, i.e. from

ED       A0-AF    80-BF    ED       B0-BF    80-BF
11101101 1010aaaa 10bbbbbb 11101101 1011cccc 10dddddd

F0-F4    80-BF    80-BF    80-BF
11110oaa 10aabbbb 10bbcccc 10dddddd    // o is "overflow" bit

Here's an online example.

回答2:

CESU-8 strings will encode surrogate pairs using the byte sequence:

ED [A0..BF] [80..BF]

That is: 0xED, followed by any byte between 0xA0 and 0xBF (inclusive), followed by any byte between 0x80 and 0xBF (inclusive).

Such a sequence of bytes cannot appear in any valid UTF-8 string, and are the only bytes allowed to appear in CESU-8 in excess of UTF-8. Checking for such a byte sequence should reliably detect CESU-8, and may be faster than decoding the entire string.

回答3:

Here is the implementation I'm currently using:

/**
 * @param string $s raw input with UTF-8 or CESU-8 encoding
 * @return string input with UTF-8 encoding
 * @license MIT
 */
protected function verifyValidUtf8($s)
{
    $s = preg_replace_callback('@(?:\xED[\xA0-\xBF][\x80-\xBF]){2}@', function ($m)
    {
        $bytes = unpack("C*", $m[0]); # always 6 bytes

        # create UCS-4 character from CESU-8 encoded surrogate pair in $bytes

        # 3 bytes CESU-8 to UNICODE high surrogate:
        $high = (($bytes[1] & 0x0F) << 12) + (($bytes[2] & 0x3F) << 6) + ($bytes[3] & 0x3F);
        # 3 bytes CESU-8 to UNICODE low surrogate:
        $low = (($bytes[4] & 0x0F) << 12) + (($bytes[5] & 0x3F) << 6) + ($bytes[6] & 0x3F);

        $codepoint = ($high & 0x03FF) << 10 | ($low & 0x03FF);
        $codepoint += 0x10000;
        return mb_convert_encoding(pack("N", $codepoint), "UTF-8", "UTF-32");
    }, $s);

    # replace unmatched surrogate pairs with U+FFFD REPLACEMENT CHARACTER
    return preg_replace('@\xED[\xA0-\xBF][\x80-\xBF]@', "\xEF\xBF\xBD", $s);
}

(You might need pack("V", ...) above if you have a big endian CPU...)

来源：https://stackoverflow.com/questions/34151138/convert-cesu-8-to-utf-8-with-high-performance

标签

php

performance

unicode

utf-8

cesu-8