问题
Regular expression languages use \B to include A..Z, a..z, 0..9, and _, and \b is defined as a word boundary.
How can I write a regular expression that matches all valid Spanish words, including characters such as: á, í, ó, é, ñ, etc.?
I'm using .NET.
回答1:
Use a Spanish locale and make your regex locale-sensitive.
回答2:
Your regex system should have something equivalent to Python's re.L
(aka re.LOCALE
) to make a regex locale-dependent, so that what's a word-character and what isn't changes with locale, as do "word boundaries" etc. Are you instead asking for a way to compensate for some given regex system not supporting locale, trying to force the issue anyway...?
回答3:
This depends heavily on the language (and regex engine) you're using.
In Perl, \w
matches all word characters, regardless of language or alphabet, and something like /\b(\w+)\b/
would (probably) match Spanish words as well as English words or Russian words.
In languages using PCRE, \w
(and therefore probably \b
) do NOT match Unicode characters. You will probably need to build your own set. I suggest something like [\wáéíóúñ]
(matches all word characters, plus the accented characters you want), and the PCRE library has to be pre-built with Unicode support before this will even work.
If you're using something else, good luck. Some regex engines don't even support Unicode.
来源:https://stackoverflow.com/questions/896374/what-is-the-regular-expression-for-a-spanish-word