How to get rid of punctuation using NLTK tokenizer?
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize() , I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word. rmalouf Take a look at the other tokenizing options that nltk provides here . For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else: from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer