UCS2/HexEncoded characters to UTF8 in php

北城以北 提交于 2019-12-07 14:55:58

问题


I asked a question previously to get a UCS-2/HexEncoded string from UTF-8, and I got some help from some guys at the following link.

UCS2/HexEncoded characters

But now I need to get the correct UTF-8 from a UCS-2/HexEncoded string in PHP.

For the following strings:

00480065006C006C006F will return 'Hello'

06450631062d0628064b06270020063906270644064500200021 will return (!مرحبا عالم) in arabic


回答1:


You can recompose a Hex-representation by converting the hexadecimal chars with hexdec(), repacking the component chars, and then using mb_convert_encoding() to convert from UCS-2 into UTF-8. As I mentioned in my answer to your other question, you'll still need to be careful with the output encoding, although here you've specifically requested UTF-8, so we'll use that for the upcoming sample.

Here's a sample that does the work of converting UCS-2 in Hex to UTF-8 in native string form. As PHP currently doesn't ship with a hex2bin() function, which would make things very easy, we'll use the one posted at the reference link at the end. I've renamed it to local_hex2bin() just in case it conflicts with a future version of PHP or with a definition in some other 3rd party code that you include in your project.

<?php
function local_hex2bin($h)
{
if (!is_string($h)) return null;
$r='';
for ($a=0; $a<strlen($h); $a+=2) { $r.=chr(hexdec($h{$a}.$h{($a+1)})); }
return $r;
};

header('Content-Type: text/html; charset=UTF-8');
mb_http_output('UTF-8');
echo '<html><head>';
echo '<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />';
echo '</head><body>';
echo 'output encoding: '.mb_http_output().'<br />';
$querystring = $_SERVER['QUERY_STRING'];
// NOTE: we could substitute one of the following:
// $querystring = '06450631062d0628064b06270020063906270644064500200021';
// $querystring = '00480065006C006C006F';
$ucs2string = local_hex2bin($querystring);
// NOTE: The source encoding could also be UTF-16 here.
// TODO: Should check byte-order-mark, if available, in case
//       16-bit-aligned bytes are reversed.
$utf8string = mb_convert_encoding($ucs2string, 'UTF-8', 'UCS-2');
echo 'query string: '.$querystring.'<br />';
echo 'converted string: '.$utf8string.'<br />';
echo '</body>';
?>

Locally, I called this sample page UCS2HexToUTF8.php, and then used a querystring to set the output.

UCS2HexToUTF8.php?06450631062d0628064b06270020063906270644064500200021
--
encoding: UTF-8
query string: 06450631062d0628064b06270020063906270644064500200021
converted string: مرحبًا عالم !

UCS2HexToUTF8.php?00480065006C006C006F
--
output encoding: UTF-8
query string: 00480065006C006C006F
converted string: Hello

Here's the link to the original source of the hex2bin() function.
PHP: bin2hex(), comment #86123 @ php.net

Also, as noted in my comments before the call to mb_convert_encoding(), you'll probably want to try and detect which endian ordering is in use by the source, especially if your application has parts where one or more CPUs on one server differ from the rest by orientation.

Here's a link that can help you identify the byte-order marks (BOM).
Byte order mark @ Wikipedia




回答2:


A more accurate conversion of UCS-2 to UTF-8

function ucs2_to_utf8($h)
{
    if (!is_string($h)) return null;
    $r='';
    for ($a=0; $a<strlen($h); $a+=4) { $r.=chr(hexdec($h{$a}.$h{($a+1)}.$h{($a+2)}.$h{($a+3)})); }
    return $r;
}

The problem on selected answer is it was divided by 2 instead of 4 which would cause converting 00 as null and will cause this � to appear when it is used on html attributes values like title="" or alt=""



来源:https://stackoverflow.com/questions/2005358/ucs2-hexencoded-characters-to-utf8-in-php

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!