I\'m trying to match a specific substring in one string with regular expression, like matching \"\\ue04a\"
in \"\\ue04a abc\"
. But something seems to b
Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t"
represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.
Generally, using r"\t"
is strongly recommended, because this removes the Python string parsing aspect. This string, with the r
prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \
and t
.
It's not clear from your question whether the target string "\ue04a abc"
should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a
, b
, c
-- in which case you would use something like
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \
and u
, followed by four hex digits:
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.
The above show how to match the "mystery sequence" "\ue04a"
; it should not then be hard to extend the code to match a longer string containing this sequence.
This should help.
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
print( m.group() )
Output:
\ue04a