I am performing the following operations on lists of words. I read lines in from a Project Gutenberg text file, split each line on spaces, perform general punctuation substituti
I suggest working smart here: use nltk's or another NLP toolkit instead.
Tokenize words like this:
import nltk
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
You may not like the fact that contractions like don't are separated. Actually, this is expected behavior. See Issue 401.
However, TweetTokenizer can help with that:
from nltk.tokenize import tknzr = TweetTokenizer()
tknzr.tokenize("The code didn't work!")
If it gets more involved a RegexpTokenizer could be helpful:
from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York. Please don't buy me\njust one of them."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)
Then it should be much easier to annotate the tokenized words correctly.
Further references: