Unicode character in PHP string

前端 未结 8 1674
生来不讨喜
生来不讨喜 2020-11-22 14:17

This question looks embarrassingly simple, but I haven\'t been able to find an answer.

What is the PHP equivalent to the following C# line of code?

s         


        
相关标签:
8条回答
  • 2020-11-22 14:27

    As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.

    As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.

    However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.

    This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.

    First, a proof example:

    // Unicode Character 'HAIR SPACE' (U+200A)
    $htmlEntityChar = " ";
    $realChar = html_entity_decode($htmlEntityChar);
    $phpChar = "\xE2\x80\x8A";
    echo 'Proof: ';
    var_dump($realChar === $phpChar); // bool(true)
    

    Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.

    The next question is, how do you get from U+200A to \xE2\x80\x8A?

    Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.

    function str_encode_utf8binary($str) {
        /** @author Krinkle 2018 */
        $output = '';
        foreach (str_split($str) as $octet) {
            $ordInt = ord($octet);
            // Convert from int (base 10) to hex (base 16), for PHP \x syntax
            $ordHex = base_convert($ordInt, 10, 16);
            $output .= '\x' . $ordHex;
        }
        return $output;
    }
    
    function str_convert_html_to_utf8binary($str) {
        return str_encode_utf8binary(html_entity_decode($str));
    }
    function str_convert_json_to_utf8binary($str) {
        return str_encode_utf8binary(json_decode($str));
    }
    
    // Example for raw string: Unicode Character 'INFINITY' (U+221E)
    echo str_encode_utf8binary('∞') . "\n";
    // \xe2\x88\x9e
    
    // Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
    echo str_convert_html_to_utf8binary(' ') . "\n";
    // \xe2\x80\x8a
    
    // Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
    echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
    // \xe2\x80\x8a
    
    0 讨论(0)
  • 2020-11-22 14:29

    I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:

    \x[0-9A-Fa-f]{1,2}

    The sequence of characters matching the regular expression is a character in hexadecimal notation.

    ASCII example:

    <?php
        echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
    ?>
    

    Hello World!

    So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:

    <?php
        header('content-type:text/html;charset=utf-16be');
        echo("\x30\xA2");
    ?>
    

    If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).

    UTF-16 little endian example:

    <?php
        header('content-type:text/html;charset=utf-16le');
        echo("\xA2\x30");
    ?>
    

    UTF-8 example:

    <?php
        header('content-type:text/html;charset=utf-8');
        echo("\xE3\x82\xA2");
    ?>
    

    There is also the pack function, but you can expect it to be slow.

    0 讨论(0)
  • 2020-11-22 14:32
    html_entity_decode('&#x30a8;', 0, 'UTF-8');
    

    This works too. However the json_decode() solution is a lot faster (around 50 times).

    0 讨论(0)
  • 2020-11-22 14:36

    PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.

    It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.

    $unicodeChar = "\u{1000}";
    
    0 讨论(0)
  • 2020-11-22 14:39
    function unicode_to_textstring($str){
    
        $rawstr = pack('H*', $str);
    
        $newstr =  iconv('UTF-16BE', 'UTF-8', $rawstr);
        return $newstr;
    }
    

    $msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';

    echo unicode_to_textstring($str);

    0 讨论(0)
  • 2020-11-22 14:44

    PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:

    function unicodeString($str, $encoding=null) {
        if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
        return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
    }
    

    Or with an anonymous function expression instead of create_function:

    function unicodeString($str, $encoding=null) {
        if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
        return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
            return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
        }, $str);
    }
    

    Its usage:

    $str = unicodeString("\u1000");
    
    0 讨论(0)
提交回复
热议问题