Apache Spark Naive Bayes based Text Classification

问题

im trying to use Apache Spark for document classification.

For example i have two types of Class (C and J)

Train data is :

C, Chinese Beijing Chinese
C, Chinese Chinese Shanghai
C, Chinese Macao
J, Tokyo Japan Chinese

And test data is : Chinese Chinese Chinese Tokyo Japan // What is ist J or C ?

How i can train and predict as above datas. I did Naive Bayes text classification with Apache Mahout, however no with Apache Spark.

How can i do this with Apache Spark?

回答1:

Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.

There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).

回答2:

Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.

def testBayesClassifier(hiveCnt:SQLContext){
    val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
    val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
    val wordsData = tokenizer.transform(trainData)
    val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
    val featureData = hashTF.transform(wordsData) //key step 1
    val trainDataRdd = featureData.select("category","features").map {
    case Row(label: Int, features: Vector) =>  //key step 2
    LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
    }
    //train the model
    val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")

    //same for the test data
    val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
    val testWordData = tokenizer.transform(testData)
    val testFeatureData = hashTF.transform(testWordData)
    val testDataRdd = testFeatureData.select("category","features").map {
    case Row(label: Int, features: Vector) =>
    LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
    }
    val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))

}

回答3:

You can use mlib's naive bayes classifier for this. A sample example is given in the link. http://spark.apache.org/docs/latest/mllib-naive-bayes.html

回答4:

There any many classification methods (logistic regression, SVMs, neural networks,LDA, QDA...), you can either implement yours or use MLlib classification methods (actually, there are logistic regression and SVM implemented in MLlib)

What you need to do is transform your features to a vector, and labels to doubles.

For examples, your dataset will look like:

1, (2,1,0,0,0,0)
1, (2,0,1,0,0,0)
0, (1,0,0,1,0,0)
0, (1,0,0,0,1,1)

And tour test vector:

(3,0,0,0,1,1)

Hope this helps

来源：https://stackoverflow.com/questions/24011418/apache-spark-naive-bayes-based-text-classification

标签

apache-spark

text-mining