Regex to match . (periods marking end of sentences) but not Mr. (as in Mr. Hopkins)

柔情痞子 提交于 2019-12-06 20:06:13

问题


I'm trying to parse a text file into sentences ending in periods, but names like Mr. Hopkins are throwing false alarms on matching for periods.

What regex identifies "." but not "Mr."

For bonus, I'm also using ! to find end of sentences, so my current Regex is /(!/./ and I'd love an answer that incorporates my !'s too.


回答1:


Use negative look behind.

(?<!Mr|Mrs|Dr|Ms)\.

This will match a period only if it does not come after Mr, Mrs, Dr or Ms

<?
   $str = "This is Mr. Someone and Mrs. Somebody. They are here to meet Dr. SomeoneElse.";
   $str = preg_replace("/(?<!Mr|Mrs|Dr|Ms)\\./", "\n", $str);
   echo($str);
?>
//outputs:
This is Mr. Someone and Mrs. Somebody
 They are here to meet Dr. SomeoneElse



回答2:


This can't be done with any simple mechanism. It's hopelessly ambiguous. Sentences can end with abbreviations, and in those cases they aren't written with two periods.

See Unicode TR29. Also see the ICU open source library, which includes a basic implementation.




回答3:


Are your sentences always followed by two spaces? If so you could just check for that...

/\.\s{2}/

and incorporating other end of sentence punctuation: /[\.\!\?]\s{2}/

You could also check other things which could be indicators of the end of a sentence, like if the next word is capitalized, is it followed by a carriage return, etc. But at best you'll just be able to make an educated guess, as pointed out above the period is just too ambiguous.



来源:https://stackoverflow.com/questions/2946045/regex-to-match-periods-marking-end-of-sentences-but-not-mr-as-in-mr-hopki

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!