Python regex matching Unicode properties

前端 未结 6 1524
我寻月下人不归
我寻月下人不归 2020-11-22 14:49

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \\p{Ll} to match an arbitrary l

6条回答
  •  情歌与酒
    2020-11-22 15:12

    Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).

    Example usage:

    >>> from unicode_hack import regex
    >>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
    >>> print pattern.match(u'疂_1+2').group(0)
    疂_1
    >>>
    

    Here's the source. There is also a JavaScript version, using the same data.

提交回复
热议问题