Is it possible to compute Document similarity for every document in an LDA corpus?

假如想象 提交于 2020-03-21 05:12:29

问题


I am going through this Notebook about LDA and Document Similarity:

https://www.kaggle.com/ktattan/lda-and-document-similarity

In this Notebook the Document similarity for a small set of documents gets computed however I want to compute the similarity for the whole corpus.

Instead of using test_df like in the Notebook:

new_bow = dictionary.doc2bow(test_df.iloc[random_article_index,7])
new_doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=new_bow)])

I want to use train_df:

new_bow= [id2word.doc2bow(doc) for doc in train_df['tokenized']]
new_doc_distribution = np.array([[tup[1] for tup in lst] for lst in model.get_document_topics(bow=new_bow)])

However this is does not work. My asumption is that its not possible because the lists that are used to create the numpy array (tup[1] in this case) are not of the same length. So its not possible to create a proper array which is needed to compute the Jensen Divergence.

Can somebody more experienced than me tell me if what I am trying is possible?

来源:https://stackoverflow.com/questions/60383368/is-it-possible-to-compute-document-similarity-for-every-document-in-an-lda-corpu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!