explode() on Japanese string

后端 未结 5 1272
情话喂你
情话喂你 2021-01-21 04:41

I have to use the explode() function on Japanese text but it doesn\'t work.

Here is an example of what I have

$string = \'私 は イタリア 人 です\';
$string = expl         


        
相关标签:
5条回答
  • 2021-01-21 05:12

    That is for the simple reason that you do not have a space character here. You have an "IDEOGRAPHIC SPACE" character with the hex code "e3 80 80".

    If you use that as your delimiter, it will work.

    0 讨论(0)
  • 2021-01-21 05:16

    There are a number of characters other than simple ASCII space that can add whitespace between characters.

    You could try using preg_split using \s (whitespace characters) or \b (word boundaries) as the pattern, however this may not be ideal as Japanese is almost certainly going to be encoded in a multiple-byte format.

    0 讨论(0)
  • 2021-01-21 05:21

    You're using the wrong space. The text uses full-width spaces (U+3000 IDEOGRAPHIC SPACE) and you're supplying a half-width space (U+0020 SPACE).

    0 讨论(0)
  • 2021-01-21 05:21

    There're two issues here.

    First of all, you don't say what your encoding is but I suppose all Japanese encodings are multi-byte. On the other side, the explode() function (like all regular PHP functions) expects single-byte input. There's no exact multi-byte equivalent but mb_split() could do the trick.

    Secondly, you are exploding by regular space (U+0020) but your string contains another character (U+3000).

    To sum up (and assuming you are using UTF-8):

    <?php
    
    mb_internal_encoding('UTF-8');
    mb_regex_encoding('UTF-8');
    
    $string = '私 は イタリア 人 です';
    print_r(mb_split(' ', $string));
    

    ... or even better:

    <?php
    
    mb_internal_encoding('UTF-8');
    mb_regex_encoding('UTF-8');
    
    $string = '私 は イタリア 人 です';
    print_r(mb_split('[[:space:]]', $string));
    
    0 讨论(0)
  • 2021-01-21 05:30

    convert your string first using iconv() and then use it on explode. Convert to utf8

    $string = explode(" ", iconv('', 'utf-8', $string));
    
    0 讨论(0)
提交回复
热议问题