From TF-IDF to LDA clustering in spark, pyspark

匿名 (未验证) 提交于 2019-12-03 00:59:01

问题:

I am trying to cluster tweets stored in the format key,listofwords

My first step has been to extract TF-IDF values for the list of words using dataframe with

dbURL = "hdfs://pathtodir"   file = sc.textFile(dbURL) #Define data frame schema fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)] schema = StructType(fields) #Data in format <key>,<listofwords> file_temp = file.map(lambda l : l.split(",")) file_df = sqlContext.createDataFrame(file_temp, schema) #Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html tokenizer = Tokenizer(inputCol='content', outputCol='words') wordsData = tokenizer.transform(file_df) hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000) featurizedData = hashingTF.transform(wordsData) idf = IDF(inputCol='rawFeatures',outputCol='features') idfModel = idf.fit(featurizedData) rescaled_data = idfModel.transform(featurizedData) 

Following the suggestion from Preparing data for LDA in spark I tried to reformat the output to what I expect to be an input to LDA, based on this example, I started as:

indexer = StringIndexer(inputCol='key',outputCol='KeyIndex') indexed_data = indexer.fit(rescaled_data).transform(rescaled_data).drop('key').drop('content').drop('words').drop('rawFeatures') 

But now I do not manage to find a good way to turn my dataframe into the format proposed in previous example or in this example

I would be very grateful if someone could point me to the correct place to look at or could correct me if my approach is wrong.

I supposed that extracting TF-IDS vectors from a series of documents and clustering them should be a fairly classical thing to do but I fail to find an easy way to do it.

回答1:

LDA expect a (id, features) as an input so assuming that KeyIndex serves as an ID:

from pyspark.mllib.clustering import LDA  k = ... # number of clusters corpus = indexed_data.select(col("KeyIndex").cast("long"), "features").map(list) model = LDA.train(corpus, k=k) 


回答2:

LDA does not take as input the TF-IDF matrix. Instead it only takes in the TF matrix. For example:

from pyspark.ml.feature import * from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer  from pyspark.ml.feature import StopWordsRemover from pyspark.ml.clustering import LDA  tokenizer = Tokenizer(inputCol="hashTagDocument", outputCol="words")  stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered",  stopWords=stopwords)  vectorizer = CountVectorizer(inputCol="filtered", outputCol="features",  vocabSize=40000, minDF=5)   pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda]) pipelineModel = pipeline.fit(corpus)  pipelineModel.stages 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!