python-re: How do I match an alpha character

后端 未结 3 1524
借酒劲吻你
借酒劲吻你 2020-11-30 08:42

How can I match an alpha character with a regular expression. I want a character that is in \\w but is not in \\d. I want it unicode compatible tha

相关标签:
3条回答
  • 2020-11-30 08:58

    Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.

    Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:

    (1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
    (2) digits => \d
    (3) underscore => _

    So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]

    Here's a simple example (Python 2.6).

    >>> import re
    >>> rx = re.compile("[^\W\d_]+", re.UNICODE)
    >>> rx.findall(u"abc_def,k9")
    [u'abc', u'def', u'k']
    

    Further exploration reveals a few quirks of this approach:

    >>> import unicodedata as ucd
    >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
    >>> for x in allsorts:
    ...     print repr(x), ucd.category(x), ucd.name(x)
    ...
    u'\u0473' Ll CYRILLIC SMALL LETTER FITA
    u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
    u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
    u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
    u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
    u'\u3020' So POSTAL MARK FACE
    u'\u3021' Nl HANGZHOU NUMERAL ONE
    >>> rx.findall(allsorts)
    [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']
    

    U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d

    U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w

    All CJK ideographs are classed as "letters" and thus match \w

    Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.

    0 讨论(0)
  • 2020-11-30 08:59

    You can use one of the following expressions to match a single letter:

    (?![\d_])\w
    

    or

    \w(?<![\d_])
    

    Here I match for \w, but check that [\d_] is not matched before/after that.

    From the docs:

    (?!...)
    Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.
    
    (?<!...)
    Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.
    
    0 讨论(0)
  • 2020-11-30 09:00

    What about:

    \p{L}
    

    You can to use this document as reference: Unicode Regular Expressions

    EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)

    Another references:

    • re.UNICODE
    • python and regular expression with unicode
    • Unicode Technical Standard #18: Unicode Regular Expressions

    For posterity, here are the examples on the blog:

    import re
    string = 'riché'
    print string
    riché
    
    richre = re.compile('([A-z]+)')
    match = richre.match(string)
    print match.groups()
    ('rich',)
    
    richre = re.compile('(\w+)',re.LOCALE)
    match = richre.match(string)
    print match.groups()
    ('rich',)
    
    richre = re.compile('([é\w]+)')
    match = richre.match(string)
    print match.groups()
    ('rich\xe9',)
    
    richre = re.compile('([\xe9\w]+)')
    match = richre.match(string)
    print match.groups()
    ('rich\xe9',)
    
    richre = re.compile('([\xe9-\xf8\w]+)')
    match = richre.match(string)
    print match.groups()
    ('rich\xe9',)
    
    string = 'richéñ'
    match = richre.match(string)
    print match.groups()
    ('rich\xe9\xf1',)
    
    richre = re.compile('([\u00E9-\u00F8\w]+)')
    print match.groups()
    ('rich\xe9\xf1',)
    
    matched = match.group(1)
    print matched
    richéñ
    
    0 讨论(0)
提交回复
热议问题