Naive-bayes multinomial text classifier using Data frame in Scala Spark

让人想犯罪 __ 提交于 2019-12-04 12:29:55

You don't need to transform your featurizedData into an RDD, because Apache Spark has two libraries ML and MLLib, the first one works with DataFrames, whereas MLLib works using RDDs. Therefore, you can work with ML because you already have a DataFrame.

In order to achieve this, you just need to rename your columns to (label, features), and fit your model, as they show in NaiveBayes, example bellow.

df = sqlContext.createDataFrame([
    Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
    Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
    Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(df)

About the error you get, is because you already have a SparseVector, and that class doesn't have a split method. So thinking more about this, your RDD almost has the structure you actually require, but you have to convert the Tuple to a LabeledPoint.

There are some techniques to improve the performance, the first one that comes to my mind is to remove stopwords (e.g. the, a, an, to, although, etc...), the second one is to count the number of different words in your texts and then construct the vectors manually, i.e. this is because if the hashing number is low then different words might have the same hash, hence a bad performance.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!