How to prepare data into a LibSVM format from DataFrame?

前端 未结 3 794
余生分开走
余生分开走 2020-12-13 07:17

I want to make libsvm format, so I made dataframe to the desired format, but I do not know how to convert to libsvm format. The format is as shown in the figure. I hope that

3条回答
  •  有刺的猬
    2020-12-13 07:58

    The issue you are facing can be divided into the following :

    • Converting your ratings (I believe) into LabeledPoint data X.
    • Saving X in libsvm format.

    1. Converting your ratings into LabeledPoint data X

    Let's consider the following raw ratings :

    val rawRatings: Seq[String] = Seq("0,1,1.0", "0,3,3.0", "1,1,1.0", "1,2,0.0", "1,3,3.0", "3,3,4.0", "10,3,4.5")
    

    You can handle those raw ratings as a coordinate list matrix (COO).

    Spark implements a distributed matrix backed by an RDD of its entries : CoordinateMatrix where each entry is a tuple of (i: Long, j: Long, value: Double).

    Note : A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse. (which is usually the case of user/item ratings.)

    import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
    import org.apache.spark.rdd.RDD
    
    val data: RDD[MatrixEntry] = 
          sc.parallelize(rawRatings).map {
                line => {
                      val fields = line.split(",")
                      val i = fields(0).toLong
                      val j = fields(1).toLong
                      val value = fields(2).toDouble
                      MatrixEntry(i, j, value)
                }
          }
    

    Now let's convert that RDD[MatrixEntry] to a CoordinateMatrix and extract the indexed rows :

    val df = new CoordinateMatrix(data) // Convert the RDD to a CoordinateMatrix
                    .toIndexedRowMatrix().rows // Extract indexed rows
                    .toDF("label", "features") // Convert rows
    

    2. Saving LabeledPoint data in libsvm format

    Since Spark 2.0, You can do that using the DataFrameWriter . Let's create a small example with some dummy LabeledPoint data (you can also use the DataFrame we created earlier) :

    import org.apache.spark.mllib.linalg.Vectors
    import org.apache.spark.mllib.regression.LabeledPoint
    val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
    val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
    
    val df = Seq(neg,pos).toDF("label","features")
    

    Unfortunately we still can't use the DataFrameWriter directly because while most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types.

    Utilities for converting DataFrame columns from mllib.linalg to ml.linalg types (and vice versa) can be found in org.apache.spark.mllib.util.MLUtils. In our case we need to do the following (for both the dummy data and the DataFrame from step 1.)

    import org.apache.spark.mllib.util.MLUtils
    // convert DataFrame columns
    val convertedVecDF = MLUtils.convertVectorColumnsToML(df)
    

    Now let's save the DataFrame :

    convertedVecDF.write.format("libsvm").save("data/foo")
    

    And we can check the files contents :

    $ cat data/foo/part*
    0.0 1:1.0 3:3.0
    1.0 1:1.0 2:0.0 3:3.0
    

    EDIT: In current version of spark (2.1.0) there is no need to use mllib package. You can simply save LabeledPoint data in libsvm format like below:

    import org.apache.spark.ml.linalg.Vectors
    import org.apache.spark.ml.feature.LabeledPoint
    val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
    val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
    
    val df = Seq(neg,pos).toDF("label","features")
    df.write.format("libsvm").save("data/foo")
    

提交回复
热议问题