Python Replace Single Quotes Except Apostrophes

前端 未结 3 1520
没有蜡笔的小新
没有蜡笔的小新 2021-01-23 02:11

I am performing the following operations on lists of words. I read lines in from a Project Gutenberg text file, split each line on spaces, perform general punctuation substituti

相关标签:
3条回答
  • 2021-01-23 02:54

    What you really need to properly replace starting and ending ' is regex. To match them you should use:

    • ^' for starting ' (opensingle),
    • '$ for ending ' (closesingle).

    Unfortunately, replace method does not support regexes, so you should use re.sub instead.

    Below you have an example program, printing your desired output (in Python 3):

    import re
    str = "don't 'George ma'am end.' didn't.' 'Won't"
    words = str.split(" ")
    for word in words:
        word = re.sub(r"^'", '<opensingle>\n', word)
        word = re.sub(r"'$", '\n<closesingle>', word)
        word = word.replace('.', '\n<period>')
        word = word.replace(',', '\n<comma>')
        print(word)
    
    0 讨论(0)
  • 2021-01-23 02:57

    I think this can benefit from lookahead or lookbehind references. The python reference is https://docs.python.org/3/library/re.html, and one generic regex site I often reference is https://www.regular-expressions.info/lookaround.html.

    Your data:

    words = ["don't",
             "'George",
             "ma'am",
             "end.'",
             "didn't.'",
             "'Won't",]
    

    And now I'll define a tuple with regular expressions and their replacements.

    In [230]: apo = (
        (re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
        (re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
        (re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
        (re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
    )
         ...:      ...:      ...:      ...:      ...:      ...: 
    In [231]: words = ["don't",
             "'George",
             "ma'am",
             "end.'",
             "didn't.'",
             "'Won't",]
         ...:      ...:      ...:      ...:      ...:      ...: 
    In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
    Out[232]: 
    ['don<apostrophe>t',
     '<opensingle>George',
     'ma<apostrophe>am',
     'end<period><closesingle>',
     'didn<apostrophe>t<period><closesingle>',
     '<opensingle>Won<apostrophe>t']
    

    Here's what's going on with the regexes:

    1. (?<=[A-Za-z]) is a lookbehind, meaning only match (but do not consume) if the preceding character is a letter.
    2. (?=[A-Za-z]) is a lookahead (still no consume) if the following character is a letter.
    3. (?<![A-Za-z]) is a negative lookbehind, meaning if there is a letter preceding it, then it will not match.
    4. (?![A-Za-z]) is a negative lookahead.

    Note that I added a . check within <closesingle>, and the order within apo matters, because you might be replacing . with <period> ...

    This was operating on single words, but should work with sentences as well.

    In [233]: onelong = """
    don't
    'George
    ma'am
    end.'
    didn't.'
    'Won't
    """
         ...:      ...:      ...:      ...:      ...:      ...:      ...: 
    In [235]: print(
        reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong)
    )
    
         ...:      ...: 
    don<apostrophe>t
    <opensingle>George
    ma<apostrophe>am
    end<period><closesingle>
    didn<apostrophe>t<period><closesingle>
    <opensingle>Won<apostrophe>t
    

    (The use of reduce is to facilitate applying a regex's .sub on the words/strings and then keep that output for the next regex's .sub, etc.)

    0 讨论(0)
  • 2021-01-23 03:16

    I suggest working smart here: use nltk's or another NLP toolkit instead.

    Tokenize words like this:

    import nltk
    sentence = """At eight o'clock on Thursday morning
    Arthur didn't feel very good."""
    tokens = nltk.word_tokenize(sentence)
    

    You may not like the fact that contractions like don't are separated. Actually, this is expected behavior. See Issue 401.

    However, TweetTokenizer can help with that:

    from nltk.tokenize import tknzr = TweetTokenizer()
    tknzr.tokenize("The code didn't work!")
    

    If it gets more involved a RegexpTokenizer could be helpful:

    from nltk.tokenize import RegexpTokenizer
    s = "Good muffins cost $3.88\nin New York.  Please don't buy me\njust one of them."
    tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
    tokenizer.tokenize(s)
    

    Then it should be much easier to annotate the tokenized words correctly.

    Further references:

    • http://www.nltk.org/api/nltk.tokenize.html
    • http://www.nltk.org/_modules/nltk/tokenize/regexp.html
    0 讨论(0)
提交回复
热议问题