How can I match an alpha character with a regular expression. I want a character that is in \\w
but is not in \\d
. I want it unicode compatible tha
Your first two sentences contradict each other. "in \w
but is not in \d
" includes underscore. I'm assuming from your third sentence that you don't want underscore.
Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:
(1) characters that are not matched by \w
(i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _
So what we don't want is anything in the character class [\W\d_]
and consequently what we do want is anything in the character class [^\W\d_]
Here's a simple example (Python 2.6).
>>> import re
>>> rx = re.compile("[^\W\d_]+", re.UNICODE)
>>> rx.findall(u"abc_def,k9")
[u'abc', u'def', u'k']
Further exploration reveals a few quirks of this approach:
>>> import unicodedata as ucd
>>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
>>> for x in allsorts:
... print repr(x), ucd.category(x), ucd.name(x)
...
u'\u0473' Ll CYRILLIC SMALL LETTER FITA
u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
u'\u3020' So POSTAL MARK FACE
u'\u3021' Nl HANGZHOU NUMERAL ONE
>>> rx.findall(allsorts)
[u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']
U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d
U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w
All CJK ideographs are classed as "letters" and thus match \w
Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.
You can use one of the following expressions to match a single letter:
(?![\d_])\w
or
\w(?<![\d_])
Here I match for \w
, but check that [\d_]
is not matched before/after that.
From the docs:
(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.
(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.
What about:
\p{L}
You can to use this document as reference: Unicode Regular Expressions
EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)
Another references:
For posterity, here are the examples on the blog:
import re
string = 'riché'
print string
riché
richre = re.compile('([A-z]+)')
match = richre.match(string)
print match.groups()
('rich',)
richre = re.compile('(\w+)',re.LOCALE)
match = richre.match(string)
print match.groups()
('rich',)
richre = re.compile('([é\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)
richre = re.compile('([\xe9\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)
richre = re.compile('([\xe9-\xf8\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)
string = 'richéñ'
match = richre.match(string)
print match.groups()
('rich\xe9\xf1',)
richre = re.compile('([\u00E9-\u00F8\w]+)')
print match.groups()
('rich\xe9\xf1',)
matched = match.group(1)
print matched
richéñ