I am looking for a good .NET regular expression that I can use for parsing out individual sentences from a body of text.
It should be able to parse the following blo
Try this @"(\S.+?[.!?])(?=\s+|$)"
:
string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";
Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)");
foreach (Match match in rx.Matches(str)) {
int i = match.Index;
Console.WriteLine(match.Value);
}
Results:
Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.
For complicated ones, of course, you will need a real parser like SharpNLP or NLTK. Mine is just a quick and dirty one.
Here is the SharpNLP info, and features:
SharpNLP is a collection of natural language processing tools written in C#. Currently it provides the following NLP tools:
I used the suggestions posted here and came up with the regex that seams to achieve what I want to do:
(?<Sentence>\S.+?(?<Terminator>[.!?]|\Z))(?=\s+|\Z)
I used Expresso to come up with:
// using System.Text.RegularExpressions;
/// <summary>
/// Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM
/// Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)]
/// \S.+?(?<Terminator>[.!?]|\Z)
/// Anything other than whitespace
/// Any character, one or more repetitions, as few as possible
/// [Terminator]: A named capture group. [[.!?]|\Z]
/// Select from 2 alternatives
/// Any character in this class: [.!?]
/// End of string or before new line at end of string
/// Match a suffix but exclude it from the capture. [\s+|\Z]
/// Select from 2 alternatives
/// Whitespace, one or more repetitions
/// End of string or before new line at end of string
///
///
/// </summary>
public static Regex regex = new Regex(
"(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)",
RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
// This is the replacement string
public static string regexReplace =
"$& [${Day}-${Month}-${Year}]";
//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);
//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);
//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);
//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);
//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);
//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();
//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();
Most have advised to use a SharpNLP and you should probably do so unless you want your QA dept to have a bug fest.
But since you are probably under some sort of pressure. Here is another attempt at dealing with words like "Dr." and "X.". But, it will fail with a sentence ending in "it."
Hello world! How are you? I am fine. This is a difficult sentence because I use I.D. Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23. See Dr. B or Mr. FooBar for H. pylori assessment in the cardia.
var result = new Regex(@"(\S.+?[.!?])(?=\s+|$)(?<!\s([A-Z]|[a-z]){1,3}.)").Split(input).Where(s => !String.IsNullOrWhiteSpace(s)).ToArray<string>();
foreach (var match in result)
{
Console.WriteLine(match);
}
This is not really possible with only regular expressions, unless you know exactly which "difficult" tokens you have, such as "i.d.", "Mr.", etc. For example, how many sentences is "Please show your I.D, Mr. Bond."? I'm not familiar with any C#-implementations, but I've used NLTK's Punkt tokenizer. Probably should not be too hard to re-implement.
It is impossible to use regexes to parse natural language. What is the end of a sentence? A period can occur in many places (e.g. e.g.). You should use a natural language parsing toolkit such as OpenNLP or NLTK. Unfortunately there are very few, if any, offerings in C#. You may therefore have to create a webservice or otherwise link into C#.
Note that it will cause problems in the future if you rely on exact whitespace as in "I.D.". You'll soon find examples that break your regex. For example most people put spaces after their intials.
There is an excellent summary of Open and commercial offerings in WP (http://en.wikipedia.org/wiki/Natural_language_processing_toolkits). We have used several of them. It's worth the effort.
[You use the word "train". This is normally associated with machine-learning (which is one approach to NLP and has been used for sentence-splitting). Indeed the toolkits I have mentioned include machine learning. I suspect that wasn't what you meant - rather that you would evolve your expression through heuristics. Don't!]
var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";
Regex.Split(str, @"(?<=[.?!])\s+").Dump();
I tested this in LINQPad.