I am trying to cluster tweets stored in the format key,listofwords
My first step has been to extract TF-IDF values for the list of words using dataframe with
dbURL = "hdfs://pathtodir"
file = sc.textFile(dbURL)
#Define data frame schema
fields = [StructField('key',StringType(),False),StructField('content',StringType(),False)]
schema = StructType(fields)
#Data in format <key>,<listofwords>
file_temp = file.map(lambda l : l.split(","))
file_df = sqlContext.createDataFrame(file_temp, schema)
#Extract TF-IDF From https://spark.apache.org/docs/1.5.2/ml-features.html
tokenizer = Tokenizer(inputCol='content', outputCol='words')
wordsData = tokenizer.transform(file_df)
hashingTF = HashingTF(inputCol='words',outputCol='rawFeatures',numFeatures=1000)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol='rawFeatures',outputCol='features')
idfModel = idf.fit(featurizedData)
rescaled_data = idfModel.transform(featurizedData)
Following the suggestion from Preparing data for LDA in spark I tried to reformat the output to what I expect to be an input to LDA, based on this example, I started as:
indexer = StringIndexer(inputCol='key',outputCol='KeyIndex')
indexed_data = indexer.fit(rescaled_data).transform(rescaled_data).drop('key').drop('content').drop('words').drop('rawFeatures')
But now I do not manage to find a good way to turn my dataframe into the format proposed in previous example or in this example
I would be very grateful if someone could point me to the correct place to look at or could correct me if my approach is wrong.
I supposed that extracting TF-IDS vectors from a series of documents and clustering them should be a fairly classical thing to do but I fail to find an easy way to do it.
LDA expect a (id, features) as an input so assuming that KeyIndex
serves as an ID:
from pyspark.mllib.clustering import LDA
k = ... # number of clusters
corpus = indexed_data.select(col("KeyIndex").cast("long"), "features").map(list)
model = LDA.train(corpus, k=k)
LDA does not take as input the TF-IDF matrix. Instead it only takes in the TF matrix. For example:
from pyspark.ml.feature import *
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.clustering import LDA
tokenizer = Tokenizer(inputCol="hashTagDocument", outputCol="words")
stopWordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered",
stopWords=stopwords)
vectorizer = CountVectorizer(inputCol="filtered", outputCol="features",
vocabSize=40000, minDF=5)
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, vectorizer, lda])
pipelineModel = pipeline.fit(corpus)
pipelineModel.stages
来源:https://stackoverflow.com/questions/35583702/from-tf-idf-to-lda-clustering-in-spark-pyspark