Regular expression with backslash in Python3

后端未结

关注

 2  479

I\'m trying to match a specific substring in one string with regular expression, like matching \"\\ue04a\" in \"\\ue04a abc\". But something seems to b

相关标签:

2条回答

鱼传尺愫

2021-01-24 15:51
Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.

Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.

It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like
```
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
```
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:
```
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
```
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.

The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.
0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2021-01-24 15:54
This should help.
```
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
    print( m.group() )
```
Output:
```
\ue04a
```
0 讨论(0)
发布评论:

提交评论
- 加载中...