问题
For example, I want to split this sentence:
I am a sentence.
Into an array with 5 parts; I
, am
, a
, sentence
, and .
.
I'm currently using preg_split
after trying explode
, but I can't seem to find something suitable.
This is what I've tried:
$sentence = explode(" ", $sentence);
/*
returns array(4) {
[0]=>
string(1) "I"
[1]=>
string(2) "am"
[2]=>
string(1) "a"
[3]=>
string(8) "sentence."
}
*/
And also this:
$sentence = preg_split("/[.?!\s]/", $sentence);
/*
returns array(5) {
[0]=>
string(1) "I"
[1]=>
string(2) "am"
[2]=>
string(1) "a"
[3]=>
string(8) "sentence"
[4]=>
string(0) ""
}
*/
How can this be done?
回答1:
You can split on word boundaries:
$sentence = preg_split("/(?<=\w)\b\s*/", 'I am a sentence.');
Pretty much the regex scans until a word character is found, then after it, the regex must capture a word boundary and some optional space.
Output:
array(5) {
[0]=>
string(1) "I"
[1]=>
string(2) "am"
[2]=>
string(1) "a"
[3]=>
string(8) "sentence"
[4]=>
string(1) "."
}
回答2:
I was looking for the same solution and landed here. The accepted solution does not work with non-word characters like apostrophes and accent marks and so forth. Below, find the solution that worked for me.
Here is my test sentence:
Claire’s favorite sonata for piano is Mozart’s Sonata no. 15 in C Major.
The accepted answer gave me the following results:
Array
(
[0] => Claire
[1] => ’s
[2] => favorite
[3] => sonata
[4] => for
[5] => piano
[6] => is
[7] => Mozart
[8] => ’s
[9] => Sonata
[10] => no
[11] => . 15
[12] => in
[13] => C
[14] => Major
[15] => .
)
The solution I came up with follows:
$parts = preg_split("/\s+|\b(?=[!\?\.])(?!\.\s+)/", $sentence);
It gives the following results:
Array
(
[0] => Claire’s
[1] => favorite
[2] => sonata
[3] => for
[4] => piano
[5] => is
[6] => Mozart’s
[7] => Sonata
[8] => no.
[9] => 15
[10] => in
[11] => C
[12] => Major
[13] => .
)
回答3:
If anyone is interested in an simple solution which ignores punctuation
preg_split( '/[^a-zA-Z0-9]+/', 'I am a sentence' );
would split into
array(4) {
[0]=>
string(1) "I"
[1]=>
string(2) "am"
[2]=>
string(1) "a"
[3]=>
string(8) "sentence"
}
Or an alternative solution where the punctuation is included in the adjacent word
preg_split( '/\b[^a-zA-Z0-9]+\b/', 'I am a sentence.' );
would split into
array(4) {
[0]=>
string(1) "I"
[1]=>
string(2) "am"
[2]=>
string(1) "a"
[3]=>
string(8) "sentence."
}
来源:https://stackoverflow.com/questions/16137575/how-can-i-split-a-sentence-into-words-and-punctuation-marks