RegEx help to remove noise words or stop words from string

和自甴很熟 提交于 2020-01-13 18:19:27

问题


I want to remove all noise tags from input tags (a string) The tags are separated by comma. If a noise word is part of a big tag, it will remain.

This is what I have but not working:

string input_string = "This,sure,about,all of our, all, values";
string stopWords = "this|is|about|after|all|also";
stopWords = string.Format(@"\s?\b(?:{0})\b\s?", stopWords);
string tags = Regex.Replace(input_string, stopWords, "", RegexOptions.IgnoreCase); 

This is what I want from above input: ",sure,,all of our,,values"

These words "This", "about", "all" will be replaced with "" since they are noise words. But "all of our" will remain even if it has the noise word "all" in it. This is because comma is the tag boundary

Anyone can give me a helping hand?

I had an alternate solution that puts the noise words into a dictionary and then search each word in input string. But I prefer RegEx approach.


回答1:


        var input = "This,sure,about,all of our, all, values";
        var stopWords = new Regex("^(this|is|about|after|all|also)$");
        var result = String.Join(",", input.Split(',').
            Where(x => !stopWords.IsMatch(x.Trim())));



回答2:


Try

stopWords = string.Format(@"(?<=^|,)\s*(?:{0})\s*(?=$|,)", stopWords);

This uses lookbehind (?<=) to determine a preceding , or start of string and lookahead (?=) to determine a trailing ,. I've also dropped the word boundry \b code because its not needed and replaced your optional whitespace \s? with \s* to match 0 or more whitespaces.

You could change the * back to a ? if you really do mean at most one space.




回答3:


I don't like using Regex for processing tasks so I will offer an alternative solution and you can decide if you want to use it or not.

string[] inputWords = input_string.Split(',');
string tags = "";

foreach(string s in inputWords)
{
   if(!storWords.Contains(s.ToLowerInvariant()))
      tags += s + ",";
}

tags = tags.TrimEnd(',');

//tags = "sure,all of our,values"


来源:https://stackoverflow.com/questions/6813377/regex-help-to-remove-noise-words-or-stop-words-from-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!