Spark MLLib Kmeans from dataframe, and back again

后端 未结 4 1444
被撕碎了的回忆
被撕碎了的回忆 2020-12-28 23:52

I aim to apply a kmeans clustering algorithm to a very large data set using Spark (1.3.1) MLLib. I have called the data from an HDFS using a hiveContext from Spark, and woul

相关标签:
4条回答
  • 2020-12-29 00:22

    I'm doing something similar using pySpark. I'm guessing you could directly translate this to Scala as there is nothing python specific. myPointsWithID is my RDD with an ID for each point and the point represented as an array of values.

    # Get an RDD of only the vectors representing the points to be clustered
    points = myPointsWithID.map(lambda (id, point): point)
    clusters = KMeans.train(points, 
                            100, 
                            maxIterations=100, 
                            runs=50,
                            initializationMode='random')
    
    # For each point in the original RDD, replace the point with the
    # ID of the cluster the point belongs to. 
    clustersBC = sc.broadcast(clusters)
    pointClusters = myPointsWithID.map(lambda (id, point): (id, clustersBC.value.predict(point)))
    
    0 讨论(0)
  • 2020-12-29 00:25

    From your code, I am assuming:

    • data is a DataFrame with three columns (label: Double, x1: Double, and x2: Double)
    • You want KMeans.predict to use x1 and x2 in order to make a cluster assignment closestCluster: Int
    • The result dataframe should be of the form (label: Double, closestCluster: Int)

    Here is simple example application with some toy data adhering to the assumed schema:

    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.clustering.KMeans
    import org.apache.spark.mllib.regression.LabeledPoint
    import org.apache.spark.sql.functions.{col, udf}
    
    case class DataRow(label: Double, x1: Double, x2: Double)
    val data = sqlContext.createDataFrame(sc.parallelize(Seq(
        DataRow(3, 1, 2),
        DataRow(5, 3, 4),
        DataRow(7, 5, 6),
        DataRow(6, 0, 0)
    )))
    
    val parsedData = data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()
    val clusters = KMeans.train(parsedData, 3, 20)
    val t = udf { (x1: Double, x2: Double) => clusters.predict(Vectors.dense(x1, x2)) }
    val result = data.select(col("label"), t(col("x1"), col("x2")))
    

    The important part are the last two lines.

    1. Creates a UDF (user-defined function) which can be directly applied to Dataframe columns (in this case, the two columns x1 and x2).

    2. Selects the label column along with the UDF applied to the x1 and x2 columns. Since the UDF will predict closestCluster, after this result will be a Dataframe consisting of (label, closestCluster)

    0 讨论(0)
  • 2020-12-29 00:29

    I understand that you want to get DataFrame at the end. I see two possible solutions. I'd say that choosing between them is matter of taste.

    Create column from RDD

    It's very easy to obtain pairs of ids and clusters in form of RDD:

    val idPointRDD = data.rdd.map(s => (s.getInt(0), Vectors.dense(s.getDouble(1),s.getDouble(2)))).cache()
    val clusters = KMeans.train(idPointRDD.map(_._2), 3, 20)
    val clustersRDD = clusters.predict(idPointRDD.map(_._2))
    val idClusterRDD = idPointRDD.map(_._1).zip(clustersRDD)
    

    Then you create DataFrame from that

    val idCluster = idClusterRDD.toDF("id", "cluster")
    

    It works because map doesn't change order of the data in RDD, which is why you can just zip ids with results of prediction.

    Use UDF (User Defined Function)

    Second method involves using clusters.predict method as UDF:

    val bcClusters = sc.broadcast(clusters)
    def predict(x: Double, y: Double): Int = {
        bcClusters.value.predict(Vectors.dense(x, y))
    }
    sqlContext.udf.register("predict", predict _)
    

    Now we can use it to add predictions to data:

    val idCluster = data.selectExpr("id", "predict(x, y) as cluster")
    

    Keep in mind that Spark API doesn't allow UDF deregistration. This means that closure data will be kept in the memory.

    Wrong / unoptimal solutions

    • Using clusters.predict without broadcasting

    It won't work in the distributed setup. Edit: actually it will work, I was confused by implementation of predict for RDD, which uses broadcast.

    • sc.makeRDD(clusters.predict(parsedData).toArray()).toDF()

    toArray collects all data in the driver. This means that in distributed mode you will be copying cluster ids into one node.

    0 讨论(0)
  • 2020-12-29 00:30

    Let me know if this code works for you:

    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.clustering._
    
    val rows = data.rdd.map(r => (r.getDouble(1),r.getDouble(2))).cache()
    val vectors = rows.map(r => Vectors.dense(r._1, r._2))
    val kMeansModel = KMeans.train(vectors, 3, 20)
    val predictions = rows.map{r => (r._1, kMeansModel.predict(Vectors.dense(r._1, r._2)))}
    val df = predictions.toDF("id", "cluster")
    df.show
    
    0 讨论(0)
提交回复
热议问题