lda

StepLDA without Cross Validation

自作多情 提交于 2019-12-11 10:06:06
问题 I would like to select the variables on the basis of the training error. For that reason I set method in trainControl to "none". However, if I run the function below twice I get two different errors (correctness rates). In this exsample the difference is not worth to mention. Even so I wouldn't have expected any difference at all. Does somebody know where this difference comes from? library(caret) c_1 <- trainControl(method = "none") maxvar <-(4) direction <-"forward" tune_1 <-data.frame

AttributeError: 'module' object has no attribute '__version__'

旧巷老猫 提交于 2019-12-11 05:54:02
问题 I have installed LDA plibrary (using pip) I have a very simple test code (the next two rows) import lda print lda.datasets.load_reuters() But i keep getting the error AttributeError: 'module' object has no attribute 'datasets' in fact i get that each time i access any attribute/function under lda! 回答1: Do you have a module named lda.py or lda.pyc in the current directory? If so, then your import statement is finding that module instead of the "real" lda module. 来源: https://stackoverflow.com

How to implement Latent Dirichlet Allocation in regression analysis

╄→гoц情女王★ 提交于 2019-12-11 05:26:13
问题 I have a dataset consisting of hotel reviews, ratings, and other features such as traveler type, and word count of the review. I want to perform topic modeling (LDA) and use the topics derived from the reviews as well as other features to identify the features that most affects the ratings (ratings as the dependent variable). If I want to use linear regression to do this, does this mean I would have to label each review with the topics derived? Is there a way to do this in R or will I have to

PyMC3 how to implement latent dirichlet allocation?

半腔热情 提交于 2019-12-11 02:19:24
问题 I am trying to implement lda using PyMC3. However, when defining the last part of the model in which words are sampled based on their topics, I keep getting the error: TypeError: list indices must be integers, not TensorVariable How to tackle the problem? The code is as follows: ## Data Preparation K = 2 # number of topics N = 4 # number of words D = 3 # number of documents import numpy as np data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]]) Wd = [len(doc) for doc in data] # length

should i use tfidf corpus or just corpus to inference documents using LDA?

删除回忆录丶 提交于 2019-12-10 14:33:22
问题 i am just wondering whether its either TFIDF corpus to be used or just corpus to be used when we are inference documents using LDA in gensim Here is an example from gensim import corpora, models import numpy.random numpy.random.seed(10) doc0 = [(0, 1), (1, 1)] doc1 = [(0,1)] doc2 = [(0, 1), (1, 1)] doc3 = [(0, 3), (1, 1)] corpus = [doc0,doc1,doc2,doc3] dictionary = corpora.Dictionary(corpus) tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] corpus_tfidf.save('x.corpus_tfidf')

How do I get perplexity and log likelihood in Spark LDA? [closed]

試著忘記壹切 提交于 2019-12-10 12:01:56
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I'm trying to get perplexity and log likelihood of a Spark LDA model (with Spark 2.1). The code below does not work (methods logLikelihood and logPerplexity not found) although I can save the model. from pyspark.mllib.clustering import LDA from pyspark.mllib.linalg import Vectors # construct corpus # run LDA

How to use Topic Model (LDA) output to match and retrieve new, same-topic documents

北慕城南 提交于 2019-12-10 11:51:54
问题 I am using a LDA model on a corpus to learn the topics covered in it. I am using the gensim package (e.g., gensim.models.ldamodel.LdaModel); can easily use other versions of LDA if necessary. My question is what is the most efficient way to use the parameterized model and/or topic words or topic IDs to find and retrieve new documents that contain the topic? Concretely, I want to scrape a media API to find new articles (out-of-sample documents) that relate to my topics contained in my original

Cosine Similarity and LDA topics

余生长醉 提交于 2019-12-10 11:06:39
问题 I want to compute Cosine Similarity between LDA topics. In fact, gensim function .matutils.cossim can do it but I dont know which parameter (vector ) I can use for this function? Here is a snap of code : import numpy as np import lda from sklearn.feature_extraction.text import CountVectorizer cvectorizer = CountVectorizer(min_df=4, max_features=10000, stop_words='english') cvz = cvectorizer.fit_transform(tweet_texts_processed) n_topics = 8 n_iter = 500 lda_model = lda.LDA(n_topics=n_topics, n

From TF-IDF to LDA clustering in spark, pyspark

和自甴很熟 提交于 2019-12-10 09:36:54
问题 I am trying to cluster tweets stored in the format key,listofwords My first step has been to extract TF-IDF values for the list of words using dataframe with dbURL = "hdfs://pathtodir" file = sc.textFile(dbURL) #Define data frame schema fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)] schema = StructType(fields) #Data in format <key>,<listofwords> file_temp = file.map(lambda l : l.split(",")) file_df = sqlContext.createDataFrame(file_temp, schema)

机器学习(十)数据降维(PCA与LDA)

会有一股神秘感。 提交于 2019-12-10 08:29:16
机器学习(十) 数据降维(此处讲PCA与LDA) 背景: 在许多领域的研究与应用中,通常需要对含有多个变量的数据进行观测,收集大量数据后进行分析寻找规律。多变量大数据集无疑会为研究和应用提供丰富的信息,但是也在一定程度上增加了数据采集的工作量。更重要的是在很多情形下,许多变量之间可能存在相关性,从而增加了问题分析的复杂性。如果分别对每个指标进行分析,分析往往是孤立的,不能完全利用数据中的信息,因此盲目减少指标会损失很多有用的信息,从而产生错误的结论。 因此需要找到一种合理的方法,在减少需要分析的指标同时,尽量减少原指标包含信息的损失,以达到对所收集数据进行全面分析的目的。由于各变量之间存在一定的相关关系,因此可以考虑将关系紧密的变量变成尽可能少的新变量,使这些新变量是两两不相关的,那么就可以用较少的综合指标分别代表存在于各个变量中的各类信息。主成分分析与因子分析就属于这类降维算法。 简介: 降维就是一种对高维度特征数据预处理方法。降维是将高维度的数据保留下最重要的一些特征,去除噪声和不重要的特征,从而实现提升数据处理速度的目的。在实际的生产和应用中,降维在一定的信息损失范围内,可以为我们节省大量的时间和成本。降维也成为应用非常广泛的数据预处理方法。 降维具有如下一些优点: 使得数据集更易使用。 降低算法的计算开销。 去除噪声。 使得结果容易理解。 PCA PCA概念: PCA