I am trying to create a term density matrix from a pandas dataframe, so I can rate terms appearing in the dataframe. I also want to be able to keep the \'spatial\' aspect of my
herrfz provides a way to handle this but I just wanted to point out that creating a term density data structure using a Python set is counterproductive, seeing as a set is a collection of unique objects. You won't be able to capture the count for each word, only the presence of a word for a given row.
return set(nltk.wordpunct_tokenize(strin)).difference(sw)
In order to strip out the stopwords you could do something like
tokens_stripped = [token for token in tokens
if token not in stopwords]
after tokenization.
You can use scikit-learn's CountVectorizer:
In [14]: from sklearn.feature_extraction.text import CountVectorizer
In [15]: countvec = CountVectorizer()
In [16]: countvec.fit_transform(df.title)
Out[16]:
<4x8 sparse matrix of type '<type 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Column format>
It returns the term document matrix in sparse representation because such matrix is usually huge and, well, sparse.
For your particular example I guess converting it back to a DataFrame would still work:
In [17]: pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
Out[17]:
boiled delicious egg else fried orange something split
0 1 1 1 0 0 0 0 0
1 0 0 1 0 1 0 0 0
2 0 0 0 0 0 1 0 1
3 0 0 0 1 0 0 1 0
[4 rows x 8 columns]