Python Replace Single Quotes Except Apostrophes

前端 未结 3 1519
没有蜡笔的小新
没有蜡笔的小新 2021-01-23 02:11

I am performing the following operations on lists of words. I read lines in from a Project Gutenberg text file, split each line on spaces, perform general punctuation substituti

3条回答
  •  鱼传尺愫
    2021-01-23 03:16

    I suggest working smart here: use nltk's or another NLP toolkit instead.

    Tokenize words like this:

    import nltk
    sentence = """At eight o'clock on Thursday morning
    Arthur didn't feel very good."""
    tokens = nltk.word_tokenize(sentence)
    

    You may not like the fact that contractions like don't are separated. Actually, this is expected behavior. See Issue 401.

    However, TweetTokenizer can help with that:

    from nltk.tokenize import tknzr = TweetTokenizer()
    tknzr.tokenize("The code didn't work!")
    

    If it gets more involved a RegexpTokenizer could be helpful:

    from nltk.tokenize import RegexpTokenizer
    s = "Good muffins cost $3.88\nin New York.  Please don't buy me\njust one of them."
    tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
    tokenizer.tokenize(s)
    

    Then it should be much easier to annotate the tokenized words correctly.

    Further references:

    • http://www.nltk.org/api/nltk.tokenize.html
    • http://www.nltk.org/_modules/nltk/tokenize/regexp.html

提交回复
热议问题