问题
I want to remove all noise tags from input tags (a string) The tags are separated by comma. If a noise word is part of a big tag, it will remain.
This is what I have but not working:
string input_string = "This,sure,about,all of our, all, values";
string stopWords = "this|is|about|after|all|also";
stopWords = string.Format(@"\s?\b(?:{0})\b\s?", stopWords);
string tags = Regex.Replace(input_string, stopWords, "", RegexOptions.IgnoreCase);
This is what I want from above input: ",sure,,all of our,,values"
These words "This", "about", "all" will be replaced with "" since they are noise words. But "all of our" will remain even if it has the noise word "all" in it. This is because comma is the tag boundary
Anyone can give me a helping hand?
I had an alternate solution that puts the noise words into a dictionary and then search each word in input string. But I prefer RegEx approach.
回答1:
var input = "This,sure,about,all of our, all, values";
var stopWords = new Regex("^(this|is|about|after|all|also)$");
var result = String.Join(",", input.Split(',').
Where(x => !stopWords.IsMatch(x.Trim())));
回答2:
Try
stopWords = string.Format(@"(?<=^|,)\s*(?:{0})\s*(?=$|,)", stopWords);
This uses lookbehind (?<=)
to determine a preceding , or start of string and lookahead (?=)
to determine a trailing ,. I've also dropped the word boundry \b
code because its not needed and replaced your optional whitespace \s?
with \s*
to match 0 or more whitespaces.
You could change the * back to a ? if you really do mean at most one space.
回答3:
I don't like using Regex for processing tasks so I will offer an alternative solution and you can decide if you want to use it or not.
string[] inputWords = input_string.Split(',');
string tags = "";
foreach(string s in inputWords)
{
if(!storWords.Contains(s.ToLowerInvariant()))
tags += s + ",";
}
tags = tags.TrimEnd(',');
//tags = "sure,all of our,values"
来源:https://stackoverflow.com/questions/6813377/regex-help-to-remove-noise-words-or-stop-words-from-string