发表新帖

发表新帖

How to get bag of words from textual data? [closed]

前端未结

关注

 5  1680

无人共我 2021-02-01 05:45

5条回答

情话喂你 (楼主)

2021-02-01 06:40
Bag of words could be defined as a matrix where each row represents a document and columns representing the individual token. One more thing, the sequential order of text is not maintained. Building a "Bag of Words" involves 3 steps
1. tokenizing
2. counting
3. normalizing
Limitations to keep in mind: 1. Cannot capture phrases or multi-word expressions 2. Sensitive to misspellings, possible to work around that using a spell corrector or character representation,

e.g.
```
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.", 
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus) 
print(X.toarray())
print(vectorizer.get_feature_names())
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题