Regex accent insensitive?

删除回忆录丶 提交于 2019-12-17 18:52:59

问题


I need a Regex in a C# program.


I've to capture a name of a file with a specific structure.

I used the \w char class, but the problem is that this class doesn't match any accented char.

Then how to do this? I just don't want to put the most used accented letter in my pattern because we can theoretically put every accent on every letter.

So I though there is maybe a syntax, to say we want a case insensitive(or a class which takes in account accent), or a Regex option which allows me to be case insensitive.

Do you know something like this?

Thank you very much


回答1:


Case-insensite works for me in this example:

     string input =@"âãäåæçèéêëìíîïðñòóôõøùúûüýþÿı";
     string pattern = @"\w+";
     MatchCollection matches = Regex.Matches (input, pattern, RegexOptions.IgnoreCase);



回答2:


You could simply replace diacritics with alphabetic (near-)equivalences, and then use use your current regex.

See for example:

How do I remove diacritics (accents) from a string in .NET?

static string RemoveDiacritics(string input)
{
    string normalized = input.Normalize(NormalizationForm.FormD);
    var builder = new StringBuilder();

    foreach (char ch in normalized)
    {
        if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
        {
            builder.Append(ch);
        }
    }

    return builder.ToString().Normalize(NormalizationForm.FormC);
}

string s1 = "Renato Núñez David DeJesús Edwin Encarnación";
string s2 = RemoveDiacritics(s1);
// s2 = "Renato Nunez David DeJesus Edwin Encarnacion"



回答3:


Use this \p{L} instead of the the class \w

\p{L} is a unicode code point with the category "letter". So it includes for example "äöüéè" and so on.

You can also use it in your own character class, if you want for example include space or the dot like this [\p{L} .]

Update:

OK, I recognized that \w in .net also include the Unicode letters and not only the ASCII ones.

So I am not sure what you are asking. If you want to allow stuff that just looks like a letter, but isn't, then I think you will end up using \S (not a whitespace).

Maybe it helps if you show some examples.




回答4:


Try this:

 String pattern = @"[\p{L}\w]+"; 



回答5:


Can you try this and see if it works:

[\u00E9-\u00F8\w]



回答6:


Don't shoot me down for this, but if you're just trying to match a filename, then why not go the other way and use excluded characters?

 [^<>:"/\|?*]



回答7:


Did you try . it should: Matches any single character except a newline character. \w: Matches any word character including underscore. Equivalent to "[A-Za-z0-9_]". So it makes sense that accented letters are excluded.

http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet



来源:https://stackoverflow.com/questions/6664582/regex-accent-insensitive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!