Apache Spark Naive Bayes based Text Classification

后端 未结 4 1873
名媛妹妹
名媛妹妹 2021-02-03 15:47

im trying to use Apache Spark for document classification.

For example i have two types of Class (C and J)

Train data is :

C, Chinese Beijing Chi         


        
4条回答
  •  傲寒
    傲寒 (楼主)
    2021-02-03 16:12

    Spark can do this in very simple way. The key step is: 1 use HashingTF to get the item frequency. 2 convert the data to the form of the bayes model needed.

    def testBayesClassifier(hiveCnt:SQLContext){
        val trainData = hiveCnt.createDataFrame(Seq((0,"aa bb aa cc"),(1,"aa dd ee"))).toDF("category","text")
        val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
        val wordsData = tokenizer.transform(trainData)
        val hashTF = new HashingTF().setInputCol("words").setOutputCol("features").setNumFeatures(20)
        val featureData = hashTF.transform(wordsData) //key step 1
        val trainDataRdd = featureData.select("category","features").map {
        case Row(label: Int, features: Vector) =>  //key step 2
        LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
        }
        //train the model
        val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")
    
        //same for the test data
        val testData = hiveCnt.createDataFrame(Seq((-1,"aa bb"),(-1,"cc ee ff"))).toDF("category","text")
        val testWordData = tokenizer.transform(testData)
        val testFeatureData = hashTF.transform(testWordData)
        val testDataRdd = testFeatureData.select("category","features").map {
        case Row(label: Int, features: Vector) =>
        LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
        }
        val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
    

    }

提交回复
热议问题