问题
preg_split
has an optional PREG_SPLIT_DELIM_CAPTURE
flag, which also returns all delimiters in the returned array. mb_split
does not.
Is there any way to split a multibyte string (not just UTF-8, but all kinds) and capture the delimiters?
I'm trying to make a multibyte-safe linebreak splitter, keeping the linebreaks, but would prefer a more genericaly usable solution.
Solution Thanks to user Casimir et Hippolyte, I built a solution and posted it on github (https://github.com/vanderlee/PHP-multibyte-functions/blob/master/functions/mb_explode.php), which allows all the preg_split flags:
/**
* A cross between mb_split and preg_split, adding the preg_split flags
* to mb_split.
* @param string $pattern
* @param string $string
* @param int $limit
* @param int $flags
* @return array
*/
function mb_explode($pattern, $string, $limit = -1, $flags = 0) {
$strlen = strlen($string); // bytes!
mb_ereg_search_init($string);
$lengths = array();
$position = 0;
while (($array = mb_ereg_search_pos($pattern)) !== false) {
// capture split
$lengths[] = array($array[0] - $position, false, null);
// move position
$position = $array[0] + $array[1];
// capture delimiter
$regs = mb_ereg_search_getregs();
$lengths[] = array($array[1], true, isset($regs[1]) && $regs[1]);
// Continue on?
if ($position >= $strlen) {
break;
}
}
// Add last bit, if not ending with split
$lengths[] = array($strlen - $position, false, null);
// Substrings
$parts = array();
$position = 0;
$count = 1;
foreach ($lengths as $length) {
$is_delimiter = $length[1];
$is_captured = $length[2];
if ($limit > 0 && !$is_delimiter && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY) && ++$count > $limit) {
if ($length[0] > 0 || ~$flags & PREG_SPLIT_NO_EMPTY) {
$parts[] = $flags & PREG_SPLIT_OFFSET_CAPTURE
? array(mb_strcut($string, $position), $position)
: mb_strcut($string, $position);
}
break;
} elseif ((!$is_delimiter || ($flags & PREG_SPLIT_DELIM_CAPTURE && $is_captured))
&& ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY)) {
$parts[] = $flags & PREG_SPLIT_OFFSET_CAPTURE
? array(mb_strcut($string, $position, $length[0]), $position)
: mb_strcut($string, $position, $length[0]);
}
$position += $length[0];
}
return $parts;
}
回答1:
Capturing delimiters is only possible with preg_split
and is not available in other functions.
So three possibilities:
1) convert your string to UTF8, use preg_split
with PREG_SPLIT_DELIM_CAPTURE
, and use array_map
to convert each items to the original encoding.
This way is the more simple. That is not the case in the second way. (Note that in general, it is more simple to work always in UTF8, instead of dealing with exotic encodings)
2) in place of a split-like function you need to use for example mb_ereg_search_regs to get the matched parts and to build the pattern like this:
delimiter|all_that_is_not_the_delimiter
(Note that the two branches of the alternation must be mutually exclusive and take care to write them in a way that makes impossible gaps between results. The first part must be at the beginning of the string and the last part must be at the end. Each part must be contiguous to the previous and so on.)
3) use mb_split
with lookarounds. By definition, lookarounds are zero-width assertions and don't match any characters but only positions in the string. So you can use this kind of pattern that matches positions after or before the delimiter:
(?=delimiter)|(<=delimiter)
(The limitation of this way is that the subpattern in the lookbehind can't have a variable length (in other words, you can't use a quantifier inside), but it can be an alternation of fixed length subpatterns: (?<=subpat1|subpat2|subpat3)
)
来源:https://stackoverflow.com/questions/30605173/php-mb-split-capturing-delimiters