Gensim LDA for text classification

问题

I post my question here because there are already some answers on how to use scikit methods with gensim like scikit vectorizers with gensim or this but I haven't seen the whole pipeline to be used for text classification. I will try to explain a little bit my situation

I want to use gensim LDA implemented methods in order to proceed further to text classification. I have one dataset which is consisted from three parts(train(25K), test(25K) and unlabeled data(50K)). What I am trying to do is to learn the latent topics space using the unlabeled data and then transform the train and test set into this learned latent topic space. I am currently using the Scikit Learn implemented methods in order to extract the BoW representation. Later, I am transforming to the required inputs for the LDA implementation and at the end I am transforming the train and test set into the extracted latent topic space. Finally, I am going back to csr matrices in order to fit a classifier and to obtain the accuracy. Although, it seems to me that everything is fine, the performance of the classifier is almost 0%. I am attaching part of the code in order to get some additional help or if there is something obvious that I am currently missing.

#bow representations for the three sets unlabelled, train and test
vectorizer = CountVectorizer(max_features=3000,stop_words='english')


corpus_tfidf_unsuper = vectorizer.fit_transform(train_data_unsupervised[:,2])
corpus_tfidf_train = vectorizer.transform(train_ds[:,2])
corpus_tfidf_test= vectorizer.transform(test_ds[:,2])

#transform to gensim acceptable objects
vocab = vectorizer.get_feature_names()
id2word_unsuper=dict([(i, s) for i, s in enumerate(vocab)])
corpus_vect_gensim_unsuper = matutils.Sparse2Corpus(corpus_tfidf_unsuper.T)
corpus_vect_gensim_train = matutils.Sparse2Corpus(corpus_tfidf_train.T)
corpus_vect_gensim_test = matutils.Sparse2Corpus(corpus_tfidf_test.T)

#fit the model to the unlabelled data
lda = models.LdaModel(corpus_vect_gensim_unsuper, 
                  id2word = id2word_unsuper, 
                  num_topics = 10,
                  passes=1)
#transform the train and test set to the latent topic space
docTopicProbMat_train = lda[corpus_vect_gensim_train]
docTopicProbMat_test = lda[corpus_vect_gensim_test]
#transform to csr matrices
train_lda=matutils.corpus2csc(docTopicProbMat_train)
test_lda=matutils.corpus2csc(docTopicProbMat_test)
#fit the classifier and print the accuracy
clf =LogisticRegression()    
clf.fit(train_lda.transpose(), np.array(train_ds[:,0]).astype(int))     
ypred = clf.predict(test_lda.transpose())
print accuracy_score(test_ds[:,0].astype(int), ypred)

This is my first post, so if there are potential remarks, please do not hesitate to inform me.

来源：https://stackoverflow.com/questions/31742630/gensim-lda-for-text-classification

标签

python

scikit-learn

lda

topic-modeling

gensim