Regex Latin characters filter and non latin character filer

萝らか妹 提交于 2019-12-04 15:19:29

The four "Latin" blocks are (from http://www.fileformat.info/info/unicode/block/index.htm):

Basic Latin U+0000 - U+007F

Latin-1 Supplement U+0080 - U+00FF

Latin Extended-A U+0100 - U+017F

Latin Extended-B U+0180 - U+024F

So a Regex to "include" all of them would be:

Regex.Match(line.Line, @"[\u0000-\u024F]+", RegexOptions.None);

while a Regex to catch anything outside the block would be:

Regex.Match(line.Line, @"[^\u0000-\u024F]+", RegexOptions.None);

Note that I do feel that doing a regex "by block" is a little wrong, especially when you use the Latin blocks, because for example in the Basic Latin block you have control characters (like new line, ...), letters (A-Z, a-z), numbers (0-9), punctation (.,;:...), other characters ($@/&...) and so on.

For the meaning of RegexOptions.None and RegexOptions.IgnoreCase

  • Their name is quite clear

  • you could try googling them on MSDN

From https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx:

RegexOptions.None: Specifies that no options are set

RegexOptions.IgnoreCase: Specifies case-insensitive matching.

the last one means that if you do Regex.Match(line.Line, @"ABC", RegexOptions.IgnoreCase) it will match ABC, Abc, abc, ... And this option works even on character ranges like [A-Z] that will match both A-Z and a-z. Note that it is probably useless in this case because the blocks I suggested should contain both the uppercase and the lowercase "variation" of letters that are both uppercase and lowercase.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!