I have random text stored in $sentences
. Using regex, I want to split the text into sentences, see:
function splitSentences($text) {
$re = \
As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.
As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.
As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.
In terms of performance - the regexes should be highly performant as all of them have either a \A
or \Z
anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.
Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.
function sentence_split($text) {
$before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
'/(?:(?:\b[Ee]tc\.\s))\Z/su',
'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
'/(?:(?:\b\p{L}\.))\Z/su',
'/(?:(?:\b\p{L}\.\s))\Z/su',
'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
'/(?:(?:[\"”\']\s*))\Z/su',
'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
$after_regexes = array('/\A(?:)/su',
'/\A(?:[\p{N}\p{Ll}])/su',
'/\A(?:[^\p{Lu}])/su',
'/\A(?:[^\p{Lu}]|I)/su',
'/\A(?:[^p{Lu}])/su',
'/\A(?:\p{Ll})/su',
'/\A(?:\p{L}\.)/su',
'/\A(?:\p{L}\.\s)/su',
'/\A(?:\p{N})/su',
'/\A(?:\s*\p{Ll})/su',
'/\A(?:)/su',
'/\A(?:\p{Lu}[^\p{Lu}])/su',
'/\A(?:\p{Lu}\p{Ll})/su');
$is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
$count = 13;
$sentences = array();
$sentence = '';
$before = '';
$after = substr($text, 0, 10);
$text = substr($text, 10);
while($text != '') {
for($i = 0; $i < $count; $i++) {
if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
if($is_sentence_boundary[$i]) {
array_push($sentences, $sentence);
$sentence = '';
}
break;
}
}
$first_from_text = $text[0];
$text = substr($text, 1);
$first_from_after = $after[0];
$after = substr($after, 1);
$before .= $first_from_after;
$sentence .= $first_from_after;
$after .= $first_from_text;
}
if($sentence != '' && $after != '') {
array_push($sentences, $sentence.$after);
}
return $sentences;
}
$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));
Henrik Petterson Please read it completely because i need to repeat few things which already said above.
As above many people have mentioned that if you add a \u modifier it will work on Unicode character is TRUE and it is Working Perfectly in the example mentioned below
http://ideone.com/750lMn
<?php
function splitSentences($text) {
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| Vol\. # or "Vol.",
| A\.D\. # or "A.D.",
| B\.C\. # or "B.C.",
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/uix';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
return $sentences;
}
$sentences = 'Entertainment media properties. Ã Fairy Tail and Tokyo Ghoul. Entertainment media properties. Â Fairy Tail and Tokyo Ghoul.';
$sentences = splitSentences($sentences);
print_r($sentences);
Your examples which you have given in comments were not working because they don't have any white space characters between two sentences. And your code specifying it particularly that there must be a white space between sentences.
\s+ # Split on whitespace between sentences.
The below example which you have in above comments is not working just because there is no space before Â.
http://ideone.com/m164fp
If spaces are unreliable, than you could use match on a .
followed by any number of spaces, followed by a capital letter.
You can match any capital UTF-8 letter using the Unicode character property \p{Lu}
.
You only need to exclude abbreviations which tend to follow own names (person names, company names, etc), since they start with a capital letter.
function splitSentences($text) {
$re = '/ # Split sentences ending with a dot
.+? # Match everything before, until we find
(
$ | # the end of the string, or
\. # a dot
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
# or... (you get the idea).
) # End negative lookbehind.
"? # Optionally match a quote
\s* # Any number of whitespaces
(?= # Begin positive lookahead
\p{Lu} | # an upper case letter, or
" # a quote
)
)
/iux';
if (!preg_match_all($re, $text, $matches, PREG_PATTERN_ORDER)) {
return [];
}
$sentences = array_map('trim', $matches[0]);
return $sentences;
}
$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
$sentences = splitSentences($text);
print_r($sentences);
Note: This answer might not be accurate enough for your situation. I'm unable to judge that. It does address the problem as described above and is easily understandable.
Â
is what it looks like when you print a UTF-8 character U+00A0 Non-Breaking Space to a page/console being interpreted as Latin-1. So I think you have a non-breaking space between the sentences, not a normal space.
\s
can match a non-breaking space too, but you will need to use the /u
modifier to tell preg you are sending it a UTF-8-encoded string. Otherwise it, like your print command, will guess Latin-1 and see it as the two characters Â
.
There is quite complex Unicode Text Segmentation algorithm that deals with various text boundaries including sentence boundaries.
http://unicode.org/reports/tr29/
The best known implementation of this algorithms is by ICU.
I have found this class: http://php.net/manual/en/class.intlbreakiterator.php however it seems to be in git not in mainstream.
So if you want to solve this VERY complex problem in best why I'd suggest to:
I believe that it is impossible to get a bullet-proof sentence splitter considering user-generated content is not always grammatically and syntactically correct. Moreover, reaching 100% correct results is just impossible due to technical imperfection of scraping/content getting tools that may fail to get clean contents that will either contain whitespace or punctuation rubbish. And finally, business is now more biased towards a good-enough strategy, and if you manage to split the text into 95% of times, it is in most cases considered a success.
Now, any sentence splitting task is an NLP task, and just one, or two, or three regexps are not enough. Rather than think of your own regex chain, I'd advise to use some existing NLP libraries for that.
The following is a rough list of the rules used to split sentences.
- Each linebreak separates sentences.
- The end of the text indicates the end if a sentence if not otherwise ended through proper punctuation.
- Sentences must be at least two words long, unless a linebreak or end-of-text.
- An empty line is not a sentence.
- Each question- or exclamation mark or combination thereof, is considered the end of a sentence.
- A single period is considered the end of a sentence, unless...
- It is preceded by one word, or...
- It is followed by one word.
- A sequence of multiple periods is not considered the end of a sentence.
Usage example:
<?php
require_once 'classes/autoloader.php'; // Include the autoloader.
$text = "Hello there, Mr. Smith. What're you doing today... Smith,"
. " my friend?\n\nI hope it's good. This last sentence will"
. " cost you $2.50! Just kidding :)"; // This is the test text we're going to use
$Sentence = new Sentence; // Create a new instance
$sentences = $Sentence->split($text); // Split into array of sentences
$count = $Sentence->count($text); // Count the number of sentences
?>
Sample code:
<?php
include ('vendor/autoload.php');
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Classifiers\ClassifierInterface;
use \NlpTools\Documents\DocumentInterface;
class EndOfSentence implements ClassifierInterface
{
public function classify(array $classes, DocumentInterface $d) {
list($token,$before,$after) = $d->getDocumentData();
$dotcnt = count(explode('.',$token))-1;
$lastdot = substr($token,-1)=='.';
if (!$lastdot) // assume that all sentences end in full stops
return 'O';
if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
return 'O';
return 'EOW';
}
}
$tok = new ClassifierBasedTokenizer(
new EndOfSentence(),
new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
Excellence, then, is not an act, but a habit.";
print_r($tok->tokenize($text));
// Array
// (
// [0] => We are what we repeatedly do.
// [1] => Excellence, then, is not an act, but a habit.
// )
IMPORTANT NOTE: Most NLP tokenization models I tested do not handle glued sentences well. However, if you add a space after a punctuation chain, sentence splitting quality raises. Just add this before sending the text to the sentence splitting function:
$txt = preg_replace('~\p{P}+~', "$0 ", $txt);