im trying to use Apache Spark for document classification.
For example i have two types of Class (C and J)
Train data is :
C, Chinese Beijing Chi
Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.
def testBayesClassifier(hiveCnt:SQLContext){
val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featureData = hashTF.transform(wordsData) //key step 1
val trainDataRdd = featureData.select("category","features").map {
case Row(label: Int, features: Vector) => //key step 2
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
//train the model
val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")
//same for the test data
val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
val testWordData = tokenizer.transform(testData)
val testFeatureData = hashTF.transform(testWordData)
val testDataRdd = testFeatureData.select("category","features").map {
case Row(label: Int, features: Vector) =>
LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
}
val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
}
You can use mlib's naive bayes classifier for this. A sample example is given in the link. http://spark.apache.org/docs/latest/mllib-naive-bayes.html
There any many classification methods (logistic regression, SVMs, neural networks,LDA, QDA...), you can either implement yours or use MLlib classification methods (actually, there are logistic regression and SVM implemented in MLlib)
What you need to do is transform your features to a vector, and labels to doubles.
For examples, your dataset will look like:
1, (2,1,0,0,0,0)
1, (2,0,1,0,0,0)
0, (1,0,0,1,0,0)
0, (1,0,0,0,1,1)
And tour test vector:
(3,0,0,0,1,1)
Hope this helps
Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.
There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).