What is the best way to split a string into an array of Unicode characters in PHP?

前端 未结 7 2288
野的像风
野的像风 2020-12-05 15:23

In PHP, what is the best way to split a string into an array of Unicode characters? If the input is not necessarily UTF-8?

I want to know whether the set of Unicode

相关标签:
7条回答
  • 2020-12-05 15:37

    If for some reason the regex way isn't enough for you. I once wrote the Zend_Locale_UTF8 which is abandoned but might be helping you if you decide to do it on your own.

    In particular have a look at the class Zend_Locale_UTF8_PHP5_String which reads in Unicode strings and to work with them splits them up into single chars(which may consist out of multiple bytes obviously).

    EDIT: I just relaized that ZF's svn-browser is down so I copied the important methods for convenience:

    /**
     * Returns the UTF-8 code sequence as an array for any given $string.
     *
     * @access protected
     * @param string|integer $string
     * @return array
     */
    protected function _decode( $string ) {
    
        $string     = (string) $string;
        $length     = strlen($string);
        $sequence   = array();
    
        for ( $i=0; $i<$length; ) {
            $bytes      = $this->_characterBytes($string, $i);
            $ord        = $this->_ord($string, $bytes, $i);
    
            if ( $ord !== false )
                $sequence[] = $ord;
    
            if ( $bytes === false )
                $i++;
            else
                $i  += $bytes;
        }
    
        return $sequence;
    
    }
    
    /**
     * Returns the UTF-8 code of a character.
     *
     * @see http://en.wikipedia.org/wiki/UTF-8#Description
     * @access protected
     * @param string $string
     * @param integer $bytes
     * @param integer $position
     * @return integer
     */
    protected function _ord( &$string, $bytes = null, $pos=0 )
    {
        if ( is_null($bytes) )
            $bytes = $this->_characterBytes($string);
    
        if ( strlen($string) >= $bytes ) {
    
            switch ( $bytes ) {
                case 1:
                    return ord($string[$pos]);
                    break;
    
                case 2:
                    return  ( (ord($string[$pos])   & 0x1f) << 6 ) +
                            ( (ord($string[$pos+1]) & 0x3f) );
                    break;
    
                case 3:
                    return  ( (ord($string[$pos])   & 0xf)  << 12 ) + 
                            ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
                            ( (ord($string[$pos+2]) & 0x3f) );
                    break;
    
                case 4:
                    return  ( (ord($string[$pos])   & 0x7)  << 18 ) + 
                            ( (ord($string[$pos+1]) & 0x3f) << 12 ) + 
                            ( (ord($string[$pos+1]) & 0x3f) << 6 ) +
                            ( (ord($string[$pos+2]) & 0x3f) );
                    break;
    
                case 0:
                default:
                    return false;
            }
        }
    
        return false;
    }
    /**
     * Returns the number of bytes of the $position-th character.
     *
     * @see http://en.wikipedia.org/wiki/UTF-8#Description
     * @access protected
     * @param string $string
     * @param integer $position
     */
    protected function _characterBytes( &$string, $position = 0 ) {
        $char       = $string[$position];
        $charVal    = ord($char);
    
        if ( ($charVal & 0x80) === 0 )
            return 1;
    
        elseif ( ($charVal & 0xe0) === 0xc0 )
            return 2;
    
        elseif ( ($charVal & 0xf0) === 0xe0 )
            return 3;
    
        elseif ( ($charVal & 0xf8) === 0xf0)
            return 4;
        /*
        elseif ( ($charVal & 0xfe) === 0xf8 )
            return 5;
        */
    
        return false;
    }
    
    0 讨论(0)
  • 2020-12-05 15:38

    Slightly simpler than preg_match_all:

    preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY)
    

    This gives you back a 1-dimensional array of characters. No need for a matches object.

    0 讨论(0)
  • 2020-12-05 15:41

    You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) :

    u (PCRE8)

    This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

    For instance, considering this code :

    header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
    $str = "abc 文字化け, efg";
    
    $results = array();
    preg_match_all('/./', $str, $results);
    var_dump($results[0]);
    

    You'll get an unusable result:

    array
      0 => string 'a' (length=1)
      1 => string 'b' (length=1)
      2 => string 'c' (length=1)
      3 => string ' ' (length=1)
      4 => string '�' (length=1)
      5 => string '�' (length=1)
      6 => string '�' (length=1)
      7 => string '�' (length=1)
      8 => string '�' (length=1)
      9 => string '�' (length=1)
      10 => string '�' (length=1)
      11 => string '�' (length=1)
      12 => string '�' (length=1)
      13 => string '�' (length=1)
      14 => string '�' (length=1)
      15 => string '�' (length=1)
      16 => string ',' (length=1)
      17 => string ' ' (length=1)
      18 => string 'e' (length=1)
      19 => string 'f' (length=1)
      20 => string 'g' (length=1)
    

    But, with this code :

    header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
    $str = "abc 文字化け, efg";
    
    $results = array();
    preg_match_all('/./u', $str, $results);
    var_dump($results[0]);
    

    (Notice the 'u' at the end of the regex)

    You get what you want :

    array
      0 => string 'a' (length=1)
      1 => string 'b' (length=1)
      2 => string 'c' (length=1)
      3 => string ' ' (length=1)
      4 => string '文' (length=3)
      5 => string '字' (length=3)
      6 => string '化' (length=3)
      7 => string 'け' (length=3)
      8 => string ',' (length=1)
      9 => string ' ' (length=1)
      10 => string 'e' (length=1)
      11 => string 'f' (length=1)
      12 => string 'g' (length=1)
    

    Hope this helps :-)

    0 讨论(0)
  • 2020-12-05 15:41

    I was able to write a solution using mb_*, including a trip to UTF-16 and back in a probably silly attempt to speed up string indexing:

    $japanese2 = mb_convert_encoding($japanese, "UTF-16", "UTF-8");
    $length = mb_strlen($japanese2, "UTF-16");
    for($i=0; $i<$length; $i++) {
        $char = mb_substr($japanese2, $i, 1, "UTF-16");
        $utf8 = mb_convert_encoding($char, "UTF-8", "UTF-16");
        print $utf8 . "\n";
    }
    

    I had better luck avoiding mb_internal_encoding and just specifying everything at each mb_* call. I'm sure I'll wind up using the preg solution.

    0 讨论(0)
  • 2020-12-05 15:45
    function str_split_unicode($str, $l = 0) {
        if ($l > 0) {
            $ret = array();
            $len = mb_strlen($str, "UTF-8");
            for ($i = 0; $i < $len; $i += $l) {
                $ret[] = mb_substr($str, $i, $l, "UTF-8");
            }
            return $ret;
        }
        return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
    }
    var_dump(str_split_unicode("لأآأئؤة"));
    

    output:

    array (size=7)
      0 => string 'ل' (length=2)
      1 => string 'أ' (length=2)
      2 => string 'آ' (length=2)
      3 => string 'أ' (length=2)
      4 => string 'ئ' (length=2)
      5 => string 'ؤ' (length=2)
      6 => string 'ة' (length=2)
    

    for more information : http://php.net/manual/en/function.str-split.php

    0 讨论(0)
  • 2020-12-05 15:46

    the best way for split with length: I just changed laravel str_limit() function:

        public static function split_text($text, $limit = 100, $end = '')
    {
        $width=mb_strwidth($text, 'UTF-8');
        if ($width <= $limit) {
            return $text;
        }
        $res=[];
        for($i=0;$i<=$width;$i=$i+$limit){
            $res[]=rtrim(mb_strimwidth($text, $i, $limit, '', 'UTF-8')).$end;
        }
         return $res;
    }
    
    0 讨论(0)
提交回复
热议问题