How can I split a sentence into words and punctuation marks?

你说的曾经没有我的故事 提交于 2019-12-22 08:17:08

问题


For example, I want to split this sentence:

I am a sentence.

Into an array with 5 parts; I, am, a, sentence, and ..

I'm currently using preg_split after trying explode, but I can't seem to find something suitable.

This is what I've tried:

$sentence = explode(" ", $sentence);
/*
returns array(4) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence."
}
*/

And also this:

$sentence = preg_split("/[.?!\s]/", $sentence);
/*
returns array(5) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
  [4]=>
  string(0) ""
}
*/

How can this be done?


回答1:


You can split on word boundaries:

$sentence = preg_split("/(?<=\w)\b\s*/", 'I am a sentence.');

Pretty much the regex scans until a word character is found, then after it, the regex must capture a word boundary and some optional space.

Output:

array(5) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
  [4]=>
  string(1) "."
}



回答2:


I was looking for the same solution and landed here. The accepted solution does not work with non-word characters like apostrophes and accent marks and so forth. Below, find the solution that worked for me.

Here is my test sentence:

Claire’s favorite sonata for piano is Mozart’s Sonata no. 15 in C Major.

The accepted answer gave me the following results:

Array
(
    [0] => Claire
    [1] => ’s
    [2] => favorite
    [3] => sonata
    [4] => for
    [5] => piano
    [6] => is
    [7] => Mozart
    [8] => ’s
    [9] => Sonata
    [10] => no
    [11] => . 15
    [12] => in
    [13] => C
    [14] => Major
    [15] => .
)

The solution I came up with follows:

$parts = preg_split("/\s+|\b(?=[!\?\.])(?!\.\s+)/", $sentence);

It gives the following results:

Array
(
    [0] => Claire’s
    [1] => favorite
    [2] => sonata
    [3] => for
    [4] => piano
    [5] => is
    [6] => Mozart’s
    [7] => Sonata
    [8] => no.
    [9] => 15
    [10] => in
    [11] => C
    [12] => Major
    [13] => .
)



回答3:


If anyone is interested in an simple solution which ignores punctuation

preg_split( '/[^a-zA-Z0-9]+/', 'I am a sentence' );

would split into

array(4) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
}

Or an alternative solution where the punctuation is included in the adjacent word

preg_split( '/\b[^a-zA-Z0-9]+\b/', 'I am a sentence.' );

would split into

array(4) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence."
}


来源:https://stackoverflow.com/questions/16137575/how-can-i-split-a-sentence-into-words-and-punctuation-marks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!