preg_split in unicode mode: delim_capture not working?

后端 未结 2 2023
情深已故
情深已故 2021-01-21 08:14

I\'m trying to use a regex to split a chunk of Chinese text into sentences. For my purposes, sentence delimiters are:

  • the fullwidth full stop 。(0x3002)
  • t
2条回答
  •  隐瞒了意图╮
    2021-01-21 08:32

    Your regex code should be like this to be able to capture string + delimiter:

    $str = "你好。你好吗? 我是程序员,不太懂这个我问题,希望大家能够帮忙!一起加油吧!";
    $arr = preg_split("/\s*([^\x{3002}\x{FF01}\x{FF1F}]+[\x{3002}\x{FF01}\x{FF1F}]\s*)/u",
                      $str, 0, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY );
    var_dump($arr);
    

    OUTPUT:

     array(4) {
      [0]=> string(9)  "你好。"
      [1]=> string(13) "你好吗? "
      [2]=> string(72) "我是程序员,不太懂这个我问题,希望大家能够帮忙!"
      [3]=> string(18) "一起加油吧!"
    }
    

提交回复
热议问题