Efficient way to create term density matrix from pandas DataFrame

前端未结

关注

 2  1976

自闭症患者 2021-02-06 07:15

I am trying to create a term density matrix from a pandas dataframe, so I can rate terms appearing in the dataframe. I also want to be able to keep the \'spatial\' aspect of my

2条回答

猫巷女王i (楼主)

2021-02-06 07:49
herrfz provides a way to handle this but I just wanted to point out that creating a term density data structure using a Python set is counterproductive, seeing as a set is a collection of unique objects. You won't be able to capture the count for each word, only the presence of a word for a given row.
```
return set(nltk.wordpunct_tokenize(strin)).difference(sw)
```
In order to strip out the stopwords you could do something like
```
tokens_stripped = [token for token in tokens 
                   if token not in stopwords]
```
after tokenization.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...