I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet.
- I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?
I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = { val session: SparkSession = SparkSession.builder.getOrCreate val wordsDf = session.createDataFrame(rdd).toDF("data", "words") val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures") val featurizedDf = hashingTF.transform(wordsDf) val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") val idfModel = idf.fit(featurizedDf) idfModel.write.save(idfModelFile.getAbsolutePath) }
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath) val documentDf = spark.createDataFrame(rdd).toDF("update", "document") val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words") val wordsDf = tokenizer.transform(documentDf) val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures") val featurizedDf = hashingTF.transform(wordsDf) val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features") val featuresDf = extractor.transform(featurizedDf) featuresDf.select("update", "features")
来源:https://stackoverflow.com/questions/40996430/how-to-use-feature-extraction-with-dstream-in-apache-spark