Split sentence into words but having trouble with the punctuations in C#

女生的网名这么多〃 提交于 2019-11-30 08:57:20

A regex solution.

(\b[^\s]+\b)

And if you really want to fix that last . on i.e. you could use this.

((\b[^\s]+\b)((?<=\.\w).)?)

Here's the code I'm using.

  var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
  var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");

  foreach(var match in matches)
  {
     Console.WriteLine(match);
  }

Results:

The
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
Earth

I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?

Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.

Cheeso

This works for me.

var str="The moon is our natural satellite, i.e. it rotates around the Earth!";
var a = str.Split(new char[] {' ', '\t'});
for (int i=0; i < a.Length; i++)
{
    Console.WriteLine(" -{0}", a[i]);
}

Results:

 -The
 -moon
 -is
 -our
 -natural
 -satellite,
 -i.e.
 -it
 -rotates
 -around
 -the
 -Earth!

you could do some post-processing of the results, removing commas and semicolons, etc.

Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!