How would you create a string of all UTF-8 characters?

前端 未结 5 459
北海茫月
北海茫月 2021-01-02 20:11

There are many ways to represent the +1 million UTF-8 characters. Take the latin capital \"A\" with macron (Ā). This is unicode code point U+0100,

相关标签:
5条回答
  • 2021-01-02 20:24

    I quickly translated this from C, but it should give you the idea:

    function encodeUTF8( $inValue ) {
        $result = "";
    
        if ( $inValue < 0x00000080 ) {
            $result .= chr( $inValue );
            $extra = 0;
        } else if ( $inValue < 0x00000800 ) {
            $result .= chr( 0x00C0 | ( ( $inValue >> 6 ) & 0x001F ) );
            $extra = 6;
        } else if ( $inValue < 0x00010000 ) {
            $result .= chr( 0x00E0 | ( ( $inValue >> 12 ) & 0x000F ) );
            $extra = 12;
        } else if ( $inValue < 0x00200000 ) {
            $result .= chr( 0x00F0 | ( ( $inValue >> 18 ) & 0x0007 ) );
            $extra = 18;
        } else if ( $inValue < 0x04000000 ) {
            $result .= chr( 0x00F8 | ( ( $inValue >> 24 ) & 0x0003 ) );
            $extra = 24;
        } else if ( $inValue < 0x80000000 ) {
            $result .= chr( 0x00FC | ( ( $inValue >> 30 ) & 0x0001 ) );
            $extra = 30;
        }
    
        while ( $extra > 0 ) {
            $result .= chr( 0x0080 | ( ( $inValue >> ( $extra -= 6 ) ) & 0x003F ) );
        }
    
        return $result;
    }
    

    The logic is sound but I am not sure about the php so be sure to check it over. I have never tried to use chr like this.

    There are a lot of values that you would not want to encode, like 0xD000-0xDFFF, 0xE000-0xF8FF and 0xFFF0-0xFFFF, and there are several other gaps for combining characters and reserved characters.

    0 讨论(0)
  • 2021-01-02 20:28

    :) of course last one wouldn't work. \x sequence belongs to the double-quoted strings.

    what's wrong with $char = chr(196).chr(128); ? with chr($a).chr($b) I mean.

    0 讨论(0)
  • 2021-01-02 20:31

    I'm not sure you can do this programmatically, mostly because there is a difference between a Unicode code point and a character. See http://www.unicode.org/standard/where for a few examples of characters that are represented by a combination of code points.

    Some code points make no sense on their own and can only be used in conjunction with another character (think accents). See http://www.unicode.org/charts/charindex.html for a list of code points, and look at the section with all the "combining" code points.

    Also, for use in testing applications, you'd need something else besides a list of possible UTF-8 code points, namely several invalid/malformed UTF-8 sequences that your app needs to be able to recover gracefully from.

    For this, take a look at Markus Kuhn's Unicode stress test.

    0 讨论(0)
  • 2021-01-02 20:48

    You can leverage iconv (or a few other functions) to convert a code point number to a UTF-8 string:

    function unichr($i)
    {
        return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
    }
    
    $codeunits = array();
    for ($i = 0; $i<0xD800; $i++)
        $codeunits[] = unichr($i);
    for ($i = 0xE000; $i<0xFFFF; $i++)
        $codeunits[] = unichr($i);
    $all = implode($codeunits);
    

    (I avoided the surrogate range 0xD800–0xDFFF as they aren't valid to put in UTF-8 themselves; that would be “CESU-8”.)

    0 讨论(0)
  • 2021-01-02 20:49
    <?php
    
    function chr_utf8($n,$f='C*'){
    return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
    ($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
    ($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
    }
    
    echo implode('',array_map('chr_utf8',range(0,65535)));
    
    // Output a big string, you can increase the range to 1114111…
    
    0 讨论(0)
提交回复
热议问题