I have seen a few similar questions but I am trying to achieve this.
Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.
the
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
earth
I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.
string[] words = Regex.Split(line, @"\W+");
Would surely appreciate some nudges in the right direction.
A regex solution.
(\b[^\s]+\b)
And if you really want to fix that last .
on i.e.
you could use this.
((\b[^\s]+\b)((?<=\.\w).)?)
Here's the code I'm using.
var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");
foreach(var match in matches)
{
Console.WriteLine(match);
}
Results:
The moon is our natural satellite i.e. it rotates around the Earth
I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?
Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.
This works for me.
var str="The moon is our natural satellite, i.e. it rotates around the Earth!";
var a = str.Split(new char[] {' ', '\t'});
for (int i=0; i < a.Length; i++)
{
Console.WriteLine(" -{0}", a[i]);
}
Results:
-The
-moon
-is
-our
-natural
-satellite,
-i.e.
-it
-rotates
-around
-the
-Earth!
you could do some post-processing of the results, removing commas and semicolons, etc.
Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value)
来源:https://stackoverflow.com/questions/7311734/split-sentence-into-words-but-having-trouble-with-the-punctuations-in-c-sharp