NOTE: I\'m not parsing lots of or html or generic html with regex. I know that\'s bad
TL;DR:
I have strings like
You are missing something, namely the r
prefix:
r = re.compile(r"\\.") # Slash followed by anything
Both python and re
attach meaning to \
; your doubled backslash becomes just one backslash when you pass the string value to re.compile()
, by which time re
sees \.
, meaning a literal full stop.:
>>> print """\\."""
\.
By using r''
you tell python not to interpret escape codes, so now re
is given a string with \\.
, meaning a literal backslash followed by any character:
>>> print r"""\\."""
\\.
Demo:
>>> import re
>>> s = "test \\* \\! test * !! **"
>>> r = re.compile(r"\\.") # Slash followed by anything
>>> r.sub("-", s)
'test - - test * !! **'
The rule of thumb is: when defining regular expressions, use r''
raw string literals, saving you to have to double-escape everything that has meaning to both Python and regular expression syntax.
Next, you want to replace the 'escaped' character; use groups for that, re.sub()
lets you reference groups as the replacement value:
r = re.compile(r"\\(.)") # Note the parethesis, that's a capturing group
r.sub(r'\1', s) # \1 means: replace with value of first capturing group
Now the output is:
>>> r = re.compile(r"\\(.)") # Note the parethesis, that's a capturing group
>>> r.sub(r'\1', s)
'test * ! test * !! **'