lda

特征工程

眉间皱痕 提交于 2019-12-17 08:35:56
上周参加了学校的数据挖掘竞赛,总的来说,在还需要人工干预的机器学习相关的任务中,主要解决两个问题:(1)如何将原始的数据处理成合格的数据输入(2)如何获得输入数据中的规律。第一个问题的解决方案是:特征工程。第二个问题的解决办法是:机器学习。 相对机器学习的算法而言,特征工程的工作看起来比较low,但是特征工程在机器学习中非常重要。特征工程,是机器学习系列任务中最耗时、最繁重、最无聊却又是最不可或缺的一部分。这些工作先行者们已经总结的很好,作为站在巨人的肩膀上的后来者,对他们的工作表示敬意。主要内容转载自 http://www.cnblogs.com/jasonfreak/p/5448385.html 这篇文章在该文章的基础上做了添加或修改,仍在更新中 特征工程 1、特征工程是什么: 工业界流传者这么一句话:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。 那么,到底什么是特征工程?我们知道,数据是信息的载体,但是原始的数据包含了大量的噪声,信息的表达也不够简练。因此, 特征工程的目的 ,是通过一系列的工程活动,将这些信息 使用更高效的编码方式 (特征)表示。使用特征表示的信息,信息损失较少,原始数据中包含的规律依然保留。此外,新的编码方式还需要 尽量减少原始数据中的不确定因素 (白噪声、异常数据、数据缺失…等等) 的影响 。 经过前人的总结

Remove empty documents from DocumentTermMatrix in R topicmodels?

廉价感情. 提交于 2019-12-17 08:24:23
问题 I am doing topic modelling using the topicmodels package in R. I am creating a Corpus object, doing some basic preprocessing, and then creating a DocumentTermMatrix: corpus <- Corpus(VectorSource(vec), readerControl=list(language="en")) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeWords, stopwords("english")) corpus <- tm_map(corpus, stripWhitespace) corpus <- tm_map(corpus, removeNumbers) ...snip removing several custom lists of

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-17 06:10:00
问题 i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document. 回答1: As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel . What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where

Preparing data for LDA in spark

半世苍凉 提交于 2019-12-14 03:44:37
问题 I'm working on implementing a Spark LDA model (via the Scala API), and am having trouble with the necessary formatting steps for my data. My raw data (stored in a text file) is in the following format, essentially a list of tokens and the documents they correspond to. A simplified example: doc XXXXX term XXXXX 1 x 'a' x 1 x 'a' x 1 x 'b' x 2 x 'b' x 2 x 'd' x ... Where the XXXXX columns are garbage data I don't care about. I realize this is an atypical way of storing corpus data, but it's

Term weighting for original LDA in gensim

六眼飞鱼酱① 提交于 2019-12-13 05:02:12
问题 I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf... My question is, what is the term weighting that should be used for the original LDA? If I have understood correctly the weights should be term frequencies, but I am not sure. 回答1: It should be a corpus represented as a "bag of words". Or, yes, lists of term counts. The correct format is that of the corpus defined in the first tutorial

Implementing Topic Model with Python (numpy)

≡放荡痞女 提交于 2019-12-12 08:54:11
问题 Recently, I implemented Gibbs sampling for LDA topic model on Python using numpy, taking as a reference some code from a site. In each iteration of Gibbs sampling, we remove one (current) word, sample a new topic for that word according to a posterior conditional probability distribution inferred from the LDA model, and update word-topic counts, as follows: for m, doc in enumerate(docs): #m: doc id for n, t in enumerate(doc): #n: id of word inside document, t: id of the word globally #

Memory efficient LDA training using gensim library

社会主义新天地 提交于 2019-12-12 02:58:02
问题 Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library. Here is the current code that I am using: from gensim import corpora, models, similarities, matutils def train_model(fname): logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) dictionary = corpora.Dictionary(line.lower().split() for line in open(fname)) print "DOC2BOW" corpus = [dictionary.doc2bow(line.lower().split()) for line

Different results of LDA using R(topicmodels)

╄→гoц情女王★ 提交于 2019-12-12 02:52:19
问题 I am using R topicmodels to train an LDA model from a small corpus, but I find that every time I repeat the same code, it has the different results (different topics and different topic terms) My question is why the same condition and same corpus has the different result every time, and what should I do to stabilize the result? Here is my code: library(tm) library(topicmodels) cname<-file.path(".","corpus","train") docs<-Corpus(DirSource(cname)) toSpace<-content_transformer(function(x,pattern

gensim.LDAMulticore throwing exception:

放肆的年华 提交于 2019-12-11 17:24:31
问题 I am running LDAMulticore from the python gensim library, and the script cannot seem to create more than one thread. Here is the error: Traceback (most recent call last): File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/usr/lib64/python2.7/multiprocessing/pool.py", line 97, in worker initializer(*initargs) File "/usr/lib64/python2

DocumentTermMatrix() return 0 terms in tm package

爱⌒轻易说出口 提交于 2019-12-11 14:51:15
问题 I have an object like that: str(apps) chr [1:17517] "35 44 33 40 33 40 44 38 33 37 37" ... In each row, the number is separated by space. corpus<-Corpus(VectorSource(apps)) dtm<-DocumentTermMatrix(corpus) str(dtm) List of 6 $ i : int(0) $ j : int(0) $ v : num(0) $ nrow : int 17517 $ ncol : int 0 $ dimnames:List of 2 ..$ Docs : chr [1:17517] "1" "2" "3" "4" ... ..$ Terms: NULL - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix" - attr(*, "weighting")= chr [1:2] "term