Uni-grams
topic1 -scuba,water,vapor,diving
topic2 -dioxide,plants,green,carbon
Given I have a dict called docs
, containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this:
from nltk.util import ngrams
for doc in docs:
docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]
Then you pass the values of this dict to the LDA model as a corpus. Bigrams joined by underscores are thus treated as single tokens.