Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \\p{Ll}
to match an arbitrary l
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...}
into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L
, Zs
), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.