I want to replace dashes which appear between letters with a space using regex. For example to replace ab-cd
with ab cd
The following matc
Use references to capturing groups:
>>> original_term = 'ab-cd'
>>> re.sub(r"([A-z])\-([A-z])", r"\1 \2", original_term)
'ab cd'
This assumes, of course, that you can't just do original_term.replace('-', ' ')
for whatever reason. Perhaps your text uses hyphens where it should use en dashes or something.
You need to use look-arounds:
new_term = re.sub(r"(?<=[A-Za-z])-(?=[A-Za-z])", " ", original_term)
Or capturing groups:
new_term = re.sub(r"([A-Za-z])-(?=[A-Za-z])", r"\1 ", original_term)
See IDEONE demo
Note that [A-z]
also matches some non-letters (namely [
, \
, ]
, ^
, _
, and `
), thus, I suggest replacing it with [A-Z]
and use a case-insensitive modifier (?i)
.
Note that you do not have to escape a hyphen outside a character class.
You need to capture the characters before and after the -
to a group and use them for replacement, i.e.:
import re
subject = "ab-cd"
subject = re.sub(r"([a-z])\-([a-z])", r"\1 \2", subject , 0, re.IGNORECASE)
print subject
#ab cd
DEMO
http://ideone.com/LAYQWT
REGEX EXPLANATION
([A-z])\-([A-z])
Match the regex below and capture its match into backreference number 1 «([A-z])»
Match a single character in the range between “A” and “z” «[A-z]»
Match the character “-” literally «\-»
Match the regex below and capture its match into backreference number 2 «([A-z])»
Match a single character in the range between “A” and “z” «[A-z]»
\1 \2
Insert the text that was last matched by capturing group number 1 «\1»
Insert the character “ ” literally « »
Insert the text that was last matched by capturing group number 2 «\2»
re.sub()
always replaces the whole matched sequence with the replacement.
A solution to only replace the dash are lookahead and lookbehind assertions. They don't count to the matched sequence.
new_term = re.sub(r"(?<=[A-z])\-(?=[A-z])", " ", original_term)
The syntax is explained in the Python documentation for the re module.