I\'m looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.
I have some text data with a few other att
The benefit of using a pandas
DataFrame
would be to apply the nltk
functionality to each row
like so:
word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]
df = pd.DataFrame(random_word_list, columns=['text'])
df.head()
text
0 Aaru Aaronic abandonable abandonedly abaction ...
1 abampere abampere abacus aback abalone abactor...
2 abaisance abalienate abandonedly abaff abacina...
3 Ababdeh abalone abac abaiser abandonable abact...
4 abandonable abandon aba abaiser abaft Abama ab...
len(df)
50
txt = df.text.apply(word_tokenize)
txt.head()
0 [Aaru, Aaronic, abandonable, abandonedly, abac...
1 [abampere, abampere, abacus, aback, abalone, a...
2 [abaisance, abalienate, abandonedly, abaff, ab...
3 [Ababdeh, abalone, abac, abaiser, abandonable,...
4 [abandonable, abandon, aba, abaiser, abaft, Ab...
txt.apply(len)
0 1000
1 1000
2 1000
3 1000
4 1000
....
44 1000
45 1000
46 1000
47 1000
48 1000
49 1000
Name: text, dtype: int64
As a result, you get the .count()
for each row
entry:
txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()
0 27
1 24
2 17
3 25
4 32
You can then sum the result using:
txt.sum()
1239