pyLDAvis visualization of pyspark generated LDA model

后端 未结 2 937
旧时难觅i
旧时难觅i 2021-02-19 10:55

Does anyone have an example of data visualization of an LDA model trained using the PySpark library (specifically using pyLDAvis)? I\'ve seen a lot of examples for GenSim and ot

2条回答
  •  傲寒
    傲寒 (楼主)
    2021-02-19 11:47

    I haven't used pyLDAvis for the visualization of pyspark's LDA but here is an example how to use the prepare for sklearn without the special sklearn.prepare.

    Here a link to the source code for pyLDAvis.prepare: https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/_prepare.py

    def prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency):
       """Transforms the topic model distributions and related corpus data into
       the data structures needed for the visualization.
        Parameters
        ----------
        topic_term_dists : array-like, shape (n_topics, n_terms)
            Matrix of topic-term probabilities. Where n_terms is len(vocab).
        doc_topic_dists : array-like, shape (n_docs, n_topics)
            Matrix of document-topic probabilities.
        doc_lengths : array-like, shape n_docs
            The length of each document, i.e. the number of words in each document.
            The order of the numbers should be consistent with the ordering of the
            docs in doc_topic_dists.
        vocab : array-like, shape n_terms
            List of all the words in the corpus used to train the model.
        term_frequency : array-like, shape n_terms
            The count of each particular term over the entire corpus. The ordering
            of these counts should correspond with `vocab` and topic_term_dists.
    

    Example for sklearn.decomposition.LatentDirichletAllocation:

    tfidf_vectorizer = TfidfVectorizer(max_df=0.95)
    tfidf = tfidf_vectorizer.fit_transform(data)
    lda = LatentDirichletAllocation(n_components=10)
    lda.fit(tfidf)
    topic_term_dists = lda.components_ / lda.components_.sum(axis=1)[:, None]
    doc_lengths = tfidf.sum(axis=1).getA1()
    term_frequency = tfidf.sum(axis=0).getA1()
    lda_doc_topic_dists = lda.transform(tfidf)
    doc_topic_dists = lda_doc_topic_dists / lda_doc_topic_dists.sum(axis=1)[:, None]
    vocab = tfidf_vectorizer.get_feature_names()
    lda_pyldavis = pyLDAvis.prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    pyLDAvis.display(lda_pyldavis)
    

提交回复
热议问题