I would like to preprocess a corpus of documents using Python in the same way that I can in R. For example, given an initial corpus, corpus
, I would like to end up
CountVectorizer
and TfidfVectorizer
can be customized as described in the docs. In particular, you'll want to write a custom tokenizer, which is a function that takes a document and returns a list of terms. Using NLTK:
import nltk.corpus.stopwords
import nltk.stem
def smart_tokenizer(doc):
doc = doc.lower()
doc = re.findall(r'\w+', doc, re.UNICODE)
return [nltk.stem.PorterStemmer().stem(term)
for term in doc
if term not in nltk.corpus.stopwords.words('english')]
Demo:
>>> v = CountVectorizer(tokenizer=smart_tokenizer)
>>> v.fit_transform([doc]).toarray()
array([[1, 1, 1, 2, 1, 1, 1, 1, 1]])
>>> from pprint import pprint
>>> pprint(v.vocabulary_)
{u'amaz': 0,
u'appl': 1,
u'best': 2,
u'ear': 3,
u'ever': 4,
u'headphon': 5,
u'pod': 6,
u'sound': 7,
u've': 8}
(The example I linked to actually uses a class to cache the lemmatizer, but a function works too.)