问题
I am going through this Notebook about LDA and Document Similarity:
https://www.kaggle.com/ktattan/lda-and-document-similarity
In this Notebook the Document similarity for a small set of documents gets computed however I want to compute the similarity for the whole corpus.
Instead of using test_df like in the Notebook:
new_bow = dictionary.doc2bow(test_df.iloc[random_article_index,7])
new_doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=new_bow)])
I want to use train_df:
new_bow= [id2word.doc2bow(doc) for doc in train_df['tokenized']]
new_doc_distribution = np.array([[tup[1] for tup in lst] for lst in model.get_document_topics(bow=new_bow)])
However this is does not work. My asumption is that its not possible because the lists that are used to create the numpy array (tup[1] in this case) are not of the same length. So its not possible to create a proper array which is needed to compute the Jensen Divergence.
Can somebody more experienced than me tell me if what I am trying is possible?
来源:https://stackoverflow.com/questions/60383368/is-it-possible-to-compute-document-similarity-for-every-document-in-an-lda-corpu