I aim to apply a kmeans clustering algorithm to a very large data set using Spark (1.3.1) MLLib. I have called the data from an HDFS using a hiveContext from Spark, and woul
From your code, I am assuming:
data
is a DataFrame with three columns (label: Double
, x1: Double
, and x2: Double
)KMeans.predict
to use x1
and x2
in order to make a cluster assignment closestCluster: Int
label: Double
, closestCluster: Int
)Here is simple example application with some toy data adhering to the assumed schema:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.functions.{col, udf}
case class DataRow(label: Double, x1: Double, x2: Double)
val data = sqlContext.createDataFrame(sc.parallelize(Seq(
DataRow(3, 1, 2),
DataRow(5, 3, 4),
DataRow(7, 5, 6),
DataRow(6, 0, 0)
)))
val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
val clusters = KMeans.train(parsedData, 3, 20)
val t = udf { (x1: Double, x2: Double) => clusters.predict(Vectors.dense(x1, x2)) }
val result = data.select(col("label"), t(col("x1"), col("x2")))
The important part are the last two lines.
Creates a UDF (user-defined function) which can be directly applied to Dataframe columns (in this case, the two columns x1
and x2
).
Selects the label
column along with the UDF applied to the x1
and x2
columns. Since the UDF will predict closestCluster
, after this result
will be a Dataframe consisting of (label
, closestCluster
)