I am trying to create a LDA model on a JSON file.
Creating a spark context with the JSON file :
import org.apache.spark.sql.SparkSession
val spa
Solution is very simple guys.. find below
//import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.linalg.Vector
I changed:
val ldaDF = countVectors.map {
case Row(id: String, countVector: Vector) => (id, countVector)
}
to:
val ldaDF = countVectors.map { case Row(docId: String, features: MLVector) =>
(docId.toLong, Vectors.fromML(features)) }
And it worked like a charm! It is aligned with what @zero323 has written.
List of imports:
import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
This has nothing to do with sparsity. Since Spark 2.0.0 ML Transformers
no longer generate o.a.s.mllib.linalg.VectorUDT
but o.a.s.ml.linalg.VectorUDT
and are mapped locally to subclasses of o.a.s.ml.linalg.Vector
. These are not compatible with old MLLib API which is moving towards deprecation in Spark 2.0.0.
You can convert between to "old" using Vectors.fromML
:
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.ml.linalg.{Vectors => NewVectors}
OldVectors.fromML(NewVectors.dense(1.0, 2.0, 3.0))
OldVectors.fromML(NewVectors.sparse(5, Seq(0 -> 1.0, 2 -> 2.0, 4 -> 3.0)))
but it make more sense to use ML
implementation of LDA if you already use ML transformers.
For convenience you can use implicit conversions:
import scala.languageFeature.implicitConversions
object VectorConversions {
import org.apache.spark.mllib.{linalg => mllib}
import org.apache.spark.ml.{linalg => ml}
implicit def toNewVector(v: mllib.Vector) = v.asML
implicit def toOldVector(v: ml.Vector) = mllib.Vectors.fromML(v)
}