Python regex matching Unicode properties

前端 未结 6 1523
我寻月下人不归
我寻月下人不归 2020-11-22 14:49

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \\p{Ll} to match an arbitrary l

相关标签:
6条回答
  • 2020-11-22 14:49

    You're right that Unicode property classes are not supported by the Python regex parser.

    If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].

    People would thank you. :)

    0 讨论(0)
  • 2020-11-22 15:11

    You can painstakingly use unicodedata on each character:

    import unicodedata
    
    def strip_accents(x):
        return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
    
    0 讨论(0)
  • 2020-11-22 15:12

    Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).

    Example usage:

    >>> from unicode_hack import regex
    >>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
    >>> print pattern.match(u'疂_1+2').group(0)
    疂_1
    >>>
    

    Here's the source. There is also a JavaScript version, using the same data.

    0 讨论(0)
  • 2020-11-22 15:13

    The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.

    0 讨论(0)
  • 2020-11-22 15:16

    Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

    0 讨论(0)
  • 2020-11-22 15:16

    Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'. The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.

    0 讨论(0)
提交回复
热议问题