问题
Below I created a full reproducible example to compute the topic model for a given DataFrame.
import numpy as np
import pandas as pd
data = pd.DataFrame({'Body': ['Here goes one example sentence that is generic',
'My car drives really fast and I have no brakes',
'Your car is slow and needs no brakes',
'Your and my vehicle are both not as fast as the airplane']})
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase = True, analyzer = 'word')
data_vectorized = vectorizer.fit_transform(data.Body)
lda_model = LatentDirichletAllocation(n_components=4,
learning_method='online',
random_state=0,
verbose=1)
lda_topic_matrix = lda_model.fit_transform(data_vectorized)
Question: How is it possible to filter documents by topic? If so, can documents have multiple topic tags, or is a threshold needed?
In the end, I like to tag every document with "1" depending on whether it has a high loading of topic 2 and topic 3, else "0".
回答1:
lda_topic_matrix
contains distribution of probabilities of a document to belong to specific topic/tag. In human it means that each row sums to 1, while the value at each index is a probability of that document to belong to a specific topic. So, each document does have all topics tags, with different degree. In case you have 4 topics, the document that has all tags equally will have a corresponding row in lda_topic_matrix
similar to
[0.25, 0.25, 0.25, 0.25]
. And the row of a document with only single topic ("0") will become something like [0.97, 0.01, 0.01, 0.01]
and document with two topics ("1" and "2") will have a distribution like [0.01, 0.54, 0.44, 0.01]
So the most simplistic approach is to select the topic with the highest probability and check whether it is 2
or 3
:
main_topic_of_document = np.argmax(lda_topic_matrix, axis=1)
tagged = ((main_topic_of_document==2) | (main_topic_of_document==3)).astype(np.int64)
This article provides a good explanation on inner mechanics of LDA.
来源:https://stackoverflow.com/questions/51448833/topicmodel-how-to-query-documents-by-topic-model-topic