Python Replace Single Quotes Except Apostrophes

大兔子大兔子 提交于 2019-12-04 05:03:02

问题


I am performing the following operations on lists of words. I read lines in from a Project Gutenberg text file, split each line on spaces, perform general punctuation substitution, and then print each word and punctuation tag on its own line for further processing later. I am unsure how to replace every single quote with a tag or excepting all apostrophes. My current method is to use a compiled regex:

apo = re.compile("[A-Za-z]'[A-Za-z]")

and perform the following operation:

if "'" in word and !apo.search(word):
    word = word.replace("'","\n<singlequote>")

but this ignores cases where a single quote is used around a word with an apostrophe. It also does not indicate to me whether the single quote is abutting the start of a word of the end of a word.

Example input:

don't
'George
ma'am
end.'
didn't.'
'Won't

Example output (after processing and printing to file):

don't
<opensingle>
George
ma'am
end
<period>
<closesingle>
didn't
<period>
<closesingle>
<opensingle>
Won't

I do have a further question in relation to this task: since the distinguishment of <opensingle> vs <closesingle> seems rather difficult, would it be wiser to perform substitutions like

word = word.replace('.','\n<period>')
word = word.replace(',','\n<comma>')

after performing the replacement operation?


回答1:


What you really need to properly replace starting and ending ' is regex. To match them you should use:

  • ^' for starting ' (opensingle),
  • '$ for ending ' (closesingle).

Unfortunately, replace method does not support regexes, so you should use re.sub instead.

Below you have an example program, printing your desired output (in Python 3):

import re
str = "don't 'George ma'am end.' didn't.' 'Won't"
words = str.split(" ")
for word in words:
    word = re.sub(r"^'", '<opensingle>\n', word)
    word = re.sub(r"'$", '\n<closesingle>', word)
    word = word.replace('.', '\n<period>')
    word = word.replace(',', '\n<comma>')
    print(word)



回答2:


I suggest working smart here: use nltk's or another NLP toolkit instead.

Tokenize words like this:

import nltk
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)

You may not like the fact that contractions like don't are separated. Actually, this is expected behavior. See Issue 401.

However, TweetTokenizer can help with that:

from nltk.tokenize import tknzr = TweetTokenizer()
tknzr.tokenize("The code didn't work!")

If it gets more involved a RegexpTokenizer could be helpful:

from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York.  Please don't buy me\njust one of them."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

Then it should be much easier to annotate the tokenized words correctly.

Further references:

  • http://www.nltk.org/api/nltk.tokenize.html
  • http://www.nltk.org/_modules/nltk/tokenize/regexp.html



回答3:


I think this can benefit from lookahead or lookbehind references. The python reference is https://docs.python.org/3/library/re.html, and one generic regex site I often reference is https://www.regular-expressions.info/lookaround.html.

Your data:

words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]

And now I'll define a tuple with regular expressions and their replacements.

In [230]: apo = (
    (re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
    (re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
    (re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
    (re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
     ...:      ...:      ...:      ...:      ...:      ...: 
In [231]: words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]
     ...:      ...:      ...:      ...:      ...:      ...: 
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]: 
['don<apostrophe>t',
 '<opensingle>George',
 'ma<apostrophe>am',
 'end<period><closesingle>',
 'didn<apostrophe>t<period><closesingle>',
 '<opensingle>Won<apostrophe>t']

Here's what's going on with the regexes:

  1. (?<=[A-Za-z]) is a lookbehind, meaning only match (but do not consume) if the preceding character is a letter.
  2. (?=[A-Za-z]) is a lookahead (still no consume) if the following character is a letter.
  3. (?<![A-Za-z]) is a negative lookbehind, meaning if there is a letter preceding it, then it will not match.
  4. (?![A-Za-z]) is a negative lookahead.

Note that I added a . check within <closesingle>, and the order within apo matters, because you might be replacing . with <period> ...

This was operating on single words, but should work with sentences as well.

In [233]: onelong = """
don't
'George
ma'am
end.'
didn't.'
'Won't
"""
     ...:      ...:      ...:      ...:      ...:      ...:      ...: 
In [235]: print(
    reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong)
)

     ...:      ...: 
don<apostrophe>t
<opensingle>George
ma<apostrophe>am
end<period><closesingle>
didn<apostrophe>t<period><closesingle>
<opensingle>Won<apostrophe>t

(The use of reduce is to facilitate applying a regex's .sub on the words/strings and then keep that output for the next regex's .sub, etc.)



来源:https://stackoverflow.com/questions/50777729/python-replace-single-quotes-except-apostrophes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!