I\'m after a regex ( php / perl compatible ) to get the first sentence out of some text. I realize this could get huge if covering every case, but just after something that
/\A(.+?)[.?!] /s
matches everything until one of those punctuation marks followed by the space. that's what sentence is, isn't? dot should match new lines
well, /^[^.]+/
is the simplest one
If sentence is "line" then simply match the first ^.*
from a chunk of text. By default the DOT does not match new line characters.
If it's really the first sentence, do something like this: ^[^.!?]*
This works in .NET:
/(?<=^\s*)(?!\s)("(\<'.*?'\>|.)*"|.)*?((?<='*"*)|[.?!]+|$)(?=\ \ |\n\n|$)/s
Handles quotation marks (American-style) (and quotes "like this 'and this.' Yes, with punctuation.") and sentences ending with multiple punctuations. Also ignores preceding whitespace. Requires two spaces or two end-of-lines or and end-of-file after sentences, though.
Handles the following well:
So much for Mr. Regex and his sentence matching, as he says "this sentence, isn't it wonderful? One says, 'It's almost as if this was crafted purely for example.'" This part shouldn't match, though.
I know you just want anything that works for now, but this mailing list post came up with /^[^\.]*\.\s/
, and the subsequent post came up with ([\s\S]+?)\.( |\r|\n)
.
Though these patterns seem only match for periods, it's up to you if you want to modify it to also match for other types of punctuation such as exclamation marks and questions marks.
It isn't just a regex, but I wrote a Python function to do this: Separating sentences. Natural language processing is notoriously difficult, so there are cases this doesn't treat right, but it does handle some tricky cases well.