Special sequences (character classes) in Python RegEx are escapes like \\w
or \\d
that matches a set of characters.
In my case, I need to b
You can exclude classes using a negative lookahead assertion, such as r'(?!\d)[\w]'
to match a word character, excluding digits. For example:
>>> re.search(r'(?!\d)[\w]', '12bac')
<_sre.SRE_Match object at 0xb7779218>
>>> _.group(0)
'b'
To exclude more than one group, you can use the usual [...]
syntax in the lookahead assertion, for example r'(?![0-5])[\w]'
would match any alphanumeric character except for digits 0-5.
As with [...]
, the above construct matches a single character. To match multiple characters, add a repetition operator:
>>> re.search(r'((?!\d)[\w])+', '12bac15')
<_sre.SRE_Match object at 0x7f44cd2588a0>
>>> _.group(0)
'bac'
You can use r"[^\W\d]"
, ie. invert the union of non-alphanumerics and numbers.
I don't think you can directly combine (boolean and) character sets in a single regex, whether one is negated or not. Otherwise you could simply have combined [^\d]
and \w
.
Note: the ^
has to be at the start of the set, and applies to the whole set. From the docs: "If the first character of the set is '^', all the characters that are not in the set will be matched.".
Your set [\w^\d]
tries to match an alpha-numerical character, followed by a caret, followed by a digit. I can imagine that doesn't match anything either.
I would do it in two steps, effectly combining the regular expressions. First match by non-digits (inner regex), then match by alpha-numerical characters:
re.search('\w+', re.search('([^\d]+)', s).group(0)).group(0)
or variations to this theme.
Note that would need to surround this with a try: except:
block, as it will throw an AttributeError: 'NoneType' object has no attribute 'group'
in case one of the two regexes fails. But you can, of course, split this single line up in a few more lines.
You cannot subtract character classes, no.
Your best bet is to use the new regex module, set to replace the current re
module in python. It supports character classes based on Unicode properties:
\p{IsAlphabetic}
This will match any character that the Unicode specification states is an alphabetic character.
Even better, regex
does support character class subtraction; it views such classes as sets and allows you to create a difference with the --
operator:
[\w--\d]
matches everything in \w
except anything that also matches \d
.