Is it possible to compute Document similarity for every document in an LDA corpus?

假如想象 提交于 2020-03-21 05:12:29


I am going through this Notebook about LDA and Document Similarity:

In this Notebook the Document similarity for a small set of documents gets computed however I want to compute the similarity for the whole corpus.

Instead of using test_df like in the Notebook:

new_bow = dictionary.doc2bow(test_df.iloc[random_article_index,7])
new_doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=new_bow)])

I want to use train_df:

new_bow= [id2word.doc2bow(doc) for doc in train_df['tokenized']]
new_doc_distribution = np.array([[tup[1] for tup in lst] for lst in model.get_document_topics(bow=new_bow)])

However this is does not work. My asumption is that its not possible because the lists that are used to create the numpy array (tup[1] in this case) are not of the same length. So its not possible to create a proper array which is needed to compute the Jensen Divergence.

Can somebody more experienced than me tell me if what I am trying is possible?

