lda

Preparing data for LDA in spark

不想你离开。 提交于 2019-12-06 10:44:15
I'm working on implementing a Spark LDA model (via the Scala API), and am having trouble with the necessary formatting steps for my data. My raw data (stored in a text file) is in the following format, essentially a list of tokens and the documents they correspond to. A simplified example: doc XXXXX term XXXXX 1 x 'a' x 1 x 'a' x 1 x 'b' x 2 x 'b' x 2 x 'd' x ... Where the XXXXX columns are garbage data I don't care about. I realize this is an atypical way of storing corpus data, but it's what I have. As is I hope is clear from the example, there's one line per token in the raw data (so if a

How to evaluate the best K for LDA using Mallet?

给你一囗甜甜゛ 提交于 2019-12-06 09:55:04
问题 I am using Mallet api to extract topic from twitter data and I have already extracted topics which are seems good topic. But I am facing problem to estimating K. For example I fixed K value from 10 to 100. So, I have taken different number of topics from the data. But, now I would like to estimate which K is best. There are some algorithm I know as Perplexity Empirical likelihood Marginal likelihood (Harmonic mean method) Silhouette I found a method model.estimate() which may be used to

机器学习:数据清洗和特征选择

孤街醉人 提交于 2019-12-06 07:01:23
数据清洗和特征选择 数据清洗 清洗过程 数据预处理: 选择数据处理工具:数据库、Python相应的包; 查看数据的元数据及数据特征; 清理异常样本数据: 处理格式或者内容错误的数据; 处理逻辑错误数据:数据去重,去除/替换不合理的值,去除/重构不可靠的字段值; 处理不需要的数据:在进行该过程时,要注意备份原始数据; 处理关联性验证错误的数据:常应用于多数据源合并的过程中。 采样: 数据不均衡处理:上采样、下采样、SMOTE算法 样本的权重问题 数据不平衡 在实际应用中,数据的分布往往是不均匀的,会出现"长尾现象",即绝大多数的数据在一个范围/属于一个类别,而在另外一个范围或者类别中,只有很少一部分数据。此时直接采用机器学习效果不会很好,因此需要对数据进行转换操作。 长尾效应: 解决方案01 设置损失函数的权重, 使得少数类别数据判断错误的损失大于多数类别数据判断错误的损失 ,即:当我们的少数类别数据预测错误的时候,会产生一个比较大的损失值,从而导致模型参数往让少数类别数据预测准确的方向偏。 可通过设置sklearn中的class_weight参数来设置权重。 解决方案02 下采样/欠采样(under sampling): 从多数类中随机抽取样本从而减少多数类别样本数据 ,使数据达到平衡的方式。 集成下采样/欠采样:采用普通的下采样方式会导致信息丢失

Cosine Similarity and LDA topics

大憨熊 提交于 2019-12-06 04:24:29
I want to compute Cosine Similarity between LDA topics. In fact, gensim function .matutils.cossim can do it but I dont know which parameter (vector ) I can use for this function? Here is a snap of code : import numpy as np import lda from sklearn.feature_extraction.text import CountVectorizer cvectorizer = CountVectorizer(min_df=4, max_features=10000, stop_words='english') cvz = cvectorizer.fit_transform(tweet_texts_processed) n_topics = 8 n_iter = 500 lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter) X_topics = lda_model.fit_transform(cvz) n_top_words = 6 topic_summaries = [] topic_word =

Spark LDA woes - prediction and OOM questions

安稳与你 提交于 2019-12-05 21:43:28
I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new ​single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents

From TF-IDF to LDA clustering in spark, pyspark

那年仲夏 提交于 2019-12-05 16:57:48
I am trying to cluster tweets stored in the format key,listofwords My first step has been to extract TF-IDF values for the list of words using dataframe with dbURL = "hdfs://pathtodir" file = sc.textFile(dbURL) #Define data frame schema fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)] schema = StructType(fields) #Data in format <key>,<listofwords> file_temp = file.map(lambda l : l.split(",")) file_df = sqlContext.createDataFrame(file_temp, schema) #Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html tokenizer = Tokenizer(inputCol=

Latent Dirichlet allocation (LDA) in Spark - replicate model

廉价感情. 提交于 2019-12-04 23:43:23
问题 I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following: 1) Import packages from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel from pyspark.ml.feature import CountVectorizer , IDF 2) Preparing the dataset countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize

the accuracy of LDA predict for new documents with Spark

寵の児 提交于 2019-12-04 21:12:38
I'm work with Mllib of Spark, and now is doing something with LDA. But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of trained document-topics. I don't know what caused the result. Asking for help, and here is my code below: train: $lda.run(corpus) the corpus is an RDD like this: $RDD[(Long, Vector)] the Vector contains vocabulary, index of words, wordcounts. predict: def predict(documents: RDD[(Long, Vector)], ldaModel: LDAModel): Array[(Long, Vector)] = { var

Understanding Spark MLlib LDA input format

雨燕双飞 提交于 2019-12-04 19:14:16
I am trying to implement LDA using Spark MLlib. But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown : 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 I followed http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda I understand the

【自然语言处理】利用LDA对希拉里邮件进行主题分析

我与影子孤独终老i 提交于 2019-12-04 18:00:33
首先是读取数据集,并将csv中ExtractedBodyText为空的给去除掉 import pandas as pd import re import os dir_path=os.path.dirname(os.path.abspath(__file__)) data_path=dir_path+"/Database/HillaryEmails.csv" df=pd.read_csv(data_path) df=df[['Id','ExtractedBodyText']].dropna() 对于这些邮件信息,并不是所有的词都是有意义的,也就是先要去除掉一些噪声数据: def clean_email_text(text): text = text.replace('\n'," ") #新行,我们是不需要的 text = re.sub(r"-", " ", text) #把 "-" 的两个单词,分开。(比如:july-edu ==> july edu) text = re.sub(r"\d+/\d+/\d+", "", text) #日期,对主体模型没什么意义 text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) #时间,没意义 text = re.sub(r"[\w]+@[\.\w]+", "", text) #邮件地址,没意义