I am performing the following operations on lists of words. I read lines in from a Project Gutenberg text file, split each line on spaces, perform general punctuation substituti
I think this can benefit from lookahead or lookbehind references. The python reference is https://docs.python.org/3/library/re.html, and one generic regex site I often reference is https://www.regular-expressions.info/lookaround.html.
Your data:
words = ["don't",
"'George",
"ma'am",
"end.'",
"didn't.'",
"'Won't",]
And now I'll define a tuple with regular expressions and their replacements.
In [230]: apo = (
(re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "",),
(re.compile("(?",),
(re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "", ),
(re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "",),
)
...: ...: ...: ...: ...: ...:
In [231]: words = ["don't",
"'George",
"ma'am",
"end.'",
"didn't.'",
"'Won't",]
...: ...: ...: ...: ...: ...:
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]:
['dont',
'George',
'maam',
'end',
'didnt',
'Wont']
Here's what's going on with the regexes:
(?<=[A-Za-z])
is a lookbehind, meaning only match (but do not consume) if the preceding character is a letter.(?=[A-Za-z])
is a lookahead (still no consume) if the following character is a letter.(? is a negative lookbehind, meaning if there is a letter preceding it, then it will not match.
(?![A-Za-z])
is a negative lookahead.Note that I added a .
check within
, and the order within apo
matters, because you might be replacing .
with
...
This was operating on single words, but should work with sentences as well.
In [233]: onelong = """
don't
'George
ma'am
end.'
didn't.'
'Won't
"""
...: ...: ...: ...: ...: ...: ...:
In [235]: print(
reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong)
)
...: ...:
dont
George
maam
end
didnt
Wont
(The use of reduce
is to facilitate applying a regex's .sub
on the words/strings and then keep that output for the next regex's .sub
, etc.)