preg_split in unicode mode: delim_capture not working?

后端未结

关注

 2  2025

情深已故

I\'m trying to use a regex to split a chunk of Chinese text into sentences. For my purposes, sentence delimiters are:

the fullwidth full stop 。(0x3002)
t

相关标签:

2条回答

隐瞒了意图╮

2021-01-21 08:32

Your regex code should be like this to be able to capture string + delimiter:

$str = "你好。你好吗？ 我是程序员，不太懂这个我问题，希望大家能够帮忙！一起加油吧！";
$arr = preg_split("/\s*([^\x{3002}\x{FF01}\x{FF1F}]+[\x{3002}\x{FF01}\x{FF1F}]\s*)/u",
                  $str, 0, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY );
var_dump($arr);

OUTPUT:

 array(4) {
  [0]=> string(9)  "你好。"
  [1]=> string(13) "你好吗？ "
  [2]=> string(72) "我是程序员，不太懂这个我问题，希望大家能够帮忙！"
  [3]=> string(18) "一起加油吧！"
}

0 讨论(0)

心在旅途

2021-01-21 08:44
You're missing the $limit parameter to preg_split().

array preg_split ( string $pattern , string $subject [, int $limit = -1 [, int $flags = 0 ]] )

As a result, you're passing PREG_SPLIT_DELIM_CAPTURE (2) + PREG_SPLIT_NO_EMPTY (1) = 3 as the $limit. That's why it's stopping at three.

Add null as the $limit parameter, and you're in good shape.
```
preg_split($pattern, $str, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...