Spark MLLib Kmeans from dataframe, and back again

后端 未结 4 1443
被撕碎了的回忆
被撕碎了的回忆 2020-12-28 23:52

I aim to apply a kmeans clustering algorithm to a very large data set using Spark (1.3.1) MLLib. I have called the data from an HDFS using a hiveContext from Spark, and woul

4条回答
  •  被撕碎了的回忆
    2020-12-29 00:25

    From your code, I am assuming:

    • data is a DataFrame with three columns (label: Double, x1: Double, and x2: Double)
    • You want KMeans.predict to use x1 and x2 in order to make a cluster assignment closestCluster: Int
    • The result dataframe should be of the form (label: Double, closestCluster: Int)

    Here is simple example application with some toy data adhering to the assumed schema:

    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.clustering.KMeans
    import org.apache.spark.mllib.regression.LabeledPoint
    import org.apache.spark.sql.functions.{col, udf}
    
    case class DataRow(label: Double, x1: Double, x2: Double)
    val data = sqlContext.createDataFrame(sc.parallelize(Seq(
        DataRow(3, 1, 2),
        DataRow(5, 3, 4),
        DataRow(7, 5, 6),
        DataRow(6, 0, 0)
    )))
    
    val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
    val clusters = KMeans.train(parsedData, 3, 20)
    val t = udf { (x1: Double, x2: Double) => clusters.predict(Vectors.dense(x1, x2)) }
    val result = data.select(col("label"), t(col("x1"), col("x2")))
    

    The important part are the last two lines.

    1. Creates a UDF (user-defined function) which can be directly applied to Dataframe columns (in this case, the two columns x1 and x2).

    2. Selects the label column along with the UDF applied to the x1 and x2 columns. Since the UDF will predict closestCluster, after this result will be a Dataframe consisting of (label, closestCluster)

提交回复
热议问题