Topic modelling, but with known topics?

喜夏-厌秋 提交于 2019-12-03 04:56:23

Why not simply use a supervised topic model. Jonathan Chang's lda package in R has an slda function that is quite nice. There is also a very helpful demo. Just install the package and run demo(slda).

I have a similar problem, and just thought I'd add the solutions I'm going with for completeness's sake.

  • I also have a set of documents (pdf documents anywhere from 1 to 200 pages), though mine are regular English text data.
  • A set of known topics (mine include subtopics, but I won't address that here). Unlike the previous example, I may desire multiple topic labels.
  • Words (standard English, though named entities and acronyms are included in my corpus)

LDAesk approach: Guided LDA

Guided LDA lets you seed words for your LDA categories. If you have n-topics for your final decisions you just create your guidedLDA algorithm with n-seed topics, each of which contain the keywords that makeup their topic name. Eg: I want to cluster into known topics "biochemistry" and "physics". Then I seed my guidedLDA with d = {0: ['biochemsitry'], 1: ['physics']}. You can incorporate other guiding words if you can identify them, however the guidedLDA algorithm I'm using (python version) makes it relatively easy to identify the top n-words for a given topic. You can run guidedLDA once with only basic seed words then use the top n-words output to consider for more words to add to topics. These top n-words also are potentially helpful for the other approach I'm mentioning.

Non-LDAesk approach: ~KNN

What I've ended up doing is using a word embedding model (word2vec has been superior to alternatives for my case) to create a "topic vector" for every topic based on the words that make up the topic/subtopic. Eg: I have a category Biochemistry with a subcategory Molecular Biology. The most basic topic vector is just the word2vec vectors for Biochemistry, Molecular, and Biology all averaged together. For every document I want to determine a topic for, I turn it into a "document vector" (same dimension & embedding model as how I made my topic vectors - I've found just averaging all the word2vec vectors in the doc has been the best solution for my so far, after a bit of preprocessing like removing stopwords). Then I just find the k-closest topic vectors to the input document vector. I should note that there's some ability to hand tune this by changing the words that makeup the topic vectors. One way to potentially identify further keywords is to use the guidedLDA model I mentioned earlier.

I would note that when I was testing these two solutions on a different corpus with labeled data (which I didn't use aside from evaluating accuracy and such) this ~KNN approach proved better than the GuidedLDA approach.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!