Bag of words could be defined as a matrix where each row represents a document and columns representing the individual token. One more thing, the sequential order of text is not maintained. Building a "Bag of Words" involves 3 steps
- tokenizing
- counting
- normalizing
Limitations to keep in mind:
1. Cannot capture phrases or multi-word expressions
2. Sensitive to misspellings, possible to work around that using a spell corrector or character representation,
e.g.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
data_corpus = ["John likes to watch movies. Mary likes movies too.",
"John also likes to watch football games."]
X = vectorizer.fit_transform(data_corpus)
print(X.toarray())
print(vectorizer.get_feature_names())