I\'m trying to split a utf8 encoded string into an array of chars. The function that I now use used to work, but for some reason it doesn\'t work anymore. W
There is a multibyte split function in PHP, mb_split.
I found out the é was not the character I expected. Apparently there is a difference between né and ńe. I got it working by normalizing the string first.
This is the best solution!:
I've found this nice solution in the PHP manual pages.
preg_split('//u', $str, null, PREG_SPLIT_NO_EMPTY);
It works really fast:
In PHP 5.6.18 it split a 6 MB big text file in a matter of seconds.
Best of all. It doesn't need MultiByte (mb_) support!
Similar answer also here.
For the mb_...
functions you should specify the charset encoding.
In your example code these are especially the following two lines:
$strLen = mb_strlen($str, 'UTF-8');
$arr[] = mb_substr($str, $i, $len, 'UTF-8');
The full picture:
function utf8Split($str, $len = 1)
{
$arr = array();
$strLen = mb_strlen($str, 'UTF-8');
for ($i = 0; $i < $strLen; $i++)
{
$arr[] = mb_substr($str, $i, $len, 'UTF-8');
}
return $arr;
}
Because you're using UTF-8 here. However, if the input is not properly encoded, this won't work "any longer" - just because it has not been designed for something else.
You can alternativly process UTF-8 encoded strings with PCRE regular expressions, for example this will return what you're looking for in less code:
$str = 'Zelf heb ik maar één vraag: wie ben jij?';
$chars = preg_split('/(?!^)(?=.)/u', $str);
Next to preg_split
there is also mb_split.
mb_internal_encoding("UTF-8");
46 arrays - off 41 arrays
If you not sure about availability of mb_string function library, then use:
Version 1:
function utf8_str_split($str='',$len=1){
preg_match_all("/./u", $str, $arr);
$arr = array_chunk($arr[0], $len);
$arr = array_map('implode', $arr);
return $arr;
}
Version 2:
function utf8_str_split($str='',$len=1){
return preg_split('/(?<=\G.{'.$len.'})/u', $str,-1,PREG_SPLIT_NO_EMPTY);
}
Both functions tested in PHP5