Apache Spark Naive Bayes based Text Classification

后端 未结 4 1885
名媛妹妹
名媛妹妹 2021-02-03 15:47

im trying to use Apache Spark for document classification.

For example i have two types of Class (C and J)

Train data is :

C, Chinese Beijing Chi         


        
相关标签:
4条回答
  • 2021-02-03 16:12

    Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.

    def testBayesClassifier(hiveCnt:SQLContext){
        val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
        val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
        val wordsData = tokenizer.transform(trainData)
        val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
        val featureData = hashTF.transform(wordsData) //key step 1
        val trainDataRdd = featureData.select("category","features").map {
        case Row(label: Int, features: Vector) =>  //key step 2
        LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
        }
        //train the model
        val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")
    
        //same for the test data
        val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
        val testWordData = tokenizer.transform(testData)
        val testFeatureData = hashTF.transform(testWordData)
        val testDataRdd = testFeatureData.select("category","features").map {
        case Row(label: Int, features: Vector) =>
        LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
        }
        val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
    

    }

    0 讨论(0)
  • 2021-02-03 16:13

    You can use mlib's naive bayes classifier for this. A sample example is given in the link. http://spark.apache.org/docs/latest/mllib-naive-bayes.html

    0 讨论(0)
  • 2021-02-03 16:22

    There any many classification methods (logistic regression, SVMs, neural networks,LDA, QDA...), you can either implement yours or use MLlib classification methods (actually, there are logistic regression and SVM implemented in MLlib)

    What you need to do is transform your features to a vector, and labels to doubles.

    For examples, your dataset will look like:

    1, (2,1,0,0,0,0)
    1, (2,0,1,0,0,0)
    0, (1,0,0,1,0,0)
    0, (1,0,0,0,1,1)
    

    And tour test vector:

    (3,0,0,0,1,1)
    

    Hope this helps

    0 讨论(0)
  • 2021-02-03 16:37

    Yes, it doesn't look like there is any simple tool to do that in Spark yet. But you can do it manually by first creating a dictionary of terms. Then compute IDFs for each term and then convert each documents into vectors using the TF-IDF scores.

    There is a post on http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ that explains how to do it (with some code as well).

    0 讨论(0)
提交回复
热议问题