I am trying to build a NaiveBayes classifier, loading the data from database as DataFrame which contains (label, text). Here's the sample of data (multinomial label):
label| feature|
+-----+--------------------+
| 1|combusting prepar...|
| 1|adhesives for ind...|
| 1| |
| 1| salt for preserving|
| 1|auxiliary fluids ...|
I have used following transformation for tokenization, stopword, n-gram, and hashTF :
val selectedData = df.select("label", "feature")
// Tokenize RDD
val tokenizer = new Tokenizer().setInputCol("feature").setOutputCol("words")
val regexTokenizer = new RegexTokenizer().setInputCol("feature").setOutputCol("words").setPattern("\\W")
val tokenized = tokenizer.transform(selectedData)
tokenized.select("words", "label").take(3).foreach(println)
// Removing stop words
val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val parsedData = remover.transform(tokenized)
// N-gram
val ngram = new NGram().setInputCol("filtered").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(parsedData)
ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(println)
//hashing function
val hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("hash").setNumFeatures(1000)
val featurizedData = hashingTF.transform(ngramDataFrame)
Output of the transformation:
+-----+--------------------+--------------------+--------------------+------ --------------+--------------------+
|label| feature| words| filtered| ngrams| hash|
+-----+--------------------+--------------------+--------------------+------ --------------+--------------------+
| 1|combusting prepar...|[combusting, prep...|[combusting, prep...| [combusting prepa...|(1000,[124,161,69...|
| 1|adhesives for ind...|[adhesives, for, ...|[adhesives, indus...| [adhesives indust...|(1000,[451,604],[...|
| 1| | []| []| []| (1000,[],[])|
| 1| salt for preserving|[salt, for, prese...| [salt, preserving]| [salt preserving]| (1000,[675],[1.0])|
| 1|auxiliary fluids ...|[auxiliary, fluid...|[auxiliary, fluid...|[auxiliary fluids...|(1000,[661,696,89...|
To build a Naive Bayes model, I need to convert the label and feature into LabelPoint
. Following approaches I have tried to convert a dataframe into RDD and create labelpoint:
val rddData = featurizedData.select("label","hash").rdd
val trainData = rddData.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0), parts(1))
}
val rddData = featurizedData.select("label","hash").rdd.map(r => (Try(r(0).asInstanceOf[Integer]).get.toDouble, Try(r(1).asInstanceOf[org.apache.spark.mllib.linalg.SparseVector]).get))
val trainData = rddData.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
}
I am getting the following error:
scala> val trainData = rddData.map { line =>
| val parts = line.split(',')
| LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
| }
<console>:67: error: value split is not a member of (Double, org.apache.spark.mllib.linalg.SparseVector)
val parts = line.split(',')
^
<console>:68: error: not found: value Vectors
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(',').map(_.toDouble)))
Edit 1:
As per below suggestion, I have created the LabelPoint and train the Model.
val trainData = featurizedData.select("label","features")
val trainLabel = trainData.map(line => LabeledPoint(Try(line(0).asInstanceOf[Integer]).get.toDouble,Try(line(1).asInsta nceOf[org.apache.spark.mllib.linalg.SparseVector]).get))
val splits = trainLabel.randomSplit(Array(0.8, 0.2), seed = 11L)
val training = splits(0)
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")
val predictionAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)}
I am getting less accuracy around 40% with N-gram and without N-gram along with different hash feature number. My dataset contains 5000 row and 45 mutlinomial label. Is there any way to improve the model performance? Thanks in advance
You don't need to transform your featurizedData
into an RDD
, because Apache Spark
has two libraries ML
and MLLib
, the first one works with DataFrame
s, whereas MLLib
works using RDD
s. Therefore, you can work with ML
because you already have a DataFrame
.
In order to achieve this, you just need to rename your columns to (label
, features
), and fit your model, as they show in NaiveBayes, example bellow.
df = sqlContext.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(df)
About the error you get, is because you already have a SparseVector
, and that class doesn't have a split
method. So thinking more about this, your RDD
almost has the structure you actually require, but you have to convert the Tuple
to a LabeledPoint
.
There are some techniques to improve the performance, the first one that comes to my mind is to remove stopwords (e.g. the, a, an, to, although, etc...), the second one is to count the number of different words in your texts and then construct the vectors manually, i.e. this is because if the hashing number is low then different words might have the same hash, hence a bad performance.
来源:https://stackoverflow.com/questions/34856042/naive-bayes-multinomial-text-classifier-using-data-frame-in-scala-spark