How can I use PHP's preg_replace function to convert Unicode code points to actual characters/HTML entities?

我怕爱的太早我们不能终老 提交于 2020-01-13 19:15:10

问题


I want to convert a set of Unicode code points in string format to actual characters and/or HTML entities (either result is fine).

For example, if I have the following string assignment:

$str = '\u304a\u306f\u3088\u3046';

I want to use the preg_replace function to convert those Unicode code points to actual characters and/or HTML entities.

As per other Stack Overflow posts I saw for similar issues, I first attempted the following:

$str = '\u304a\u306f\u3088\u3046';
$str2 = preg_replace('/\u[0-9a-f]+/', '&#x$1;', $str);

However, whenever I attempt to do this, I get the following PHP error:

Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u

I tried all sorts of things like adding the u flag to the regex or changing /\u[0-9a-f]+/ to /\x{[0-9a-f]+}/, but nothing seems to work.

Also, I've looked at all sorts of other relevant pages/posts I could find on the web related to converting Unicode code points to actual characters in PHP, but either I'm missing something crucial, or something is wrong because I can't fix the issue I'm having.

Can someone please offer me a concrete solution on how to convert a string of Unicode code points to actual characters and/or a string of HTML entities?


回答1:


From the PHP manual:

Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.

First of all, in your regular expression, you're only using one backslash (\). As explained in the PHP manual, you need to use \\\\ to match a literal backslash (with some exceptions).

Second, you are missing the capturing groups in your original expression. preg_replace() searches the given string for matches to the supplied pattern and returns the string where the contents matched by the capturing groups are replaced with the replacement string.

The updated regular expression with proper escaping and correct capturing groups would look like:

$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);

Output:

おはよう

Expression: \\\\u([0-9a-f]+)

  • \\\\ - matches a literal backslash
  • u - matches the literal u character
  • ( - beginning of the capturing group
    • [0-9a-f] - character class -- matches a digit (0 - 9) or an alphabet (from a - f) one or more times
  • ) - end of capturing group
  • i modifier - used for case-insensitive matching

Replacement: &#x$1

  • & - literal ampersand character (&)
  • # - literal pound character (#)
  • x - literal character x
  • $1 - contents of the first capturing group -- in this case, the strings of the form 304a etc.

RegExr Demo.




回答2:


This page here—titled Escaping Unicode Characters to HTML Entities in PHP—seems to tackle it with this nice function:

function unicode_escape_sequences($str){
  $working = json_encode($str);
  $working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
  return json_decode($working);
}

That seems to work with json_encode and json_decode to take pure UTF-8 and convert it into Unicode. Very nice technique. But for your example, this would work.

$str = '\u304a\u306f\u3088\u3046';
echo preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $str);

The output is:

おはよう

Which is:

おはよう

Which translates to:

Good morning



来源:https://stackoverflow.com/questions/20931113/how-can-i-use-phps-preg-replace-function-to-convert-unicode-code-points-to-actu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!