I have a data file (an Apple plist, to be exact), that has Unicode codepoints like \\U00e8
and \\U2019
. I need to turn these into valid hexadecima
Here's a correct answer, that deals with the fact that those are code units, not code points, and allows unencoding supplementary characters.
function unenc_utf16_code_units($string) {
/* go for possible surrogate pairs first */
$string = preg_replace_callback(
'/\\\\U(D[89ab][0-9a-f]{2})\\\\U(D[c-f][0-9a-f]{2})/i',
function ($matches) {
$hi_surr = hexdec($matches[1]);
$lo_surr = hexdec($matches[2]);
$scalar = (0x10000 + (($hi_surr & 0x3FF) << 10) |
($lo_surr & 0x3FF));
return "" . dechex($scalar) . ";";
}, $string);
/* now the rest */
$string = preg_replace_callback('/\\\\U([0-9a-f]{4})/i',
function ($matches) {
//just to remove leading zeros
return "" . dechex(hexdec($matches[1])) . ";";
}, $string);
return $string;
}