How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

前端 未结 3 1405
别跟我提以往
别跟我提以往 2020-12-28 19:55

I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD. To get a (labe

相关标签:
3条回答
  • 2020-12-28 20:12

    While @zero323 answer https://stackoverflow.com/a/32745924/1333621 makes sense, and I wish it worked for me - the rdd underlying the dataframe, sqlContext.createDataFrame(temp_rdd, schema), the still contained SparseVectors types I had to do the following to convert to DenseVector types - if someone has a shorter/better way I want to know

    temp_rdd = sc.parallelize([
        (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
        (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])
    
    schema = StructType([
        StructField("label", DoubleType(), True),
        StructField("features", VectorUDT(), True)
    ])
    
    temp_rdd.toDF(schema).printSchema()
    df_w_ftr = temp_rdd.toDF(schema)
    
    print 'original convertion method: ',df_w_ftr.take(5)
    print('\n')
    temp_rdd_dense = temp_rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
    print type(temp_rdd_dense), type(temp_rdd)
    print 'using map and toArray:', temp_rdd_dense.take(5)
    
    temp_rdd_dense.toDF().show()
    
    root
     |-- label: double (nullable = true)
     |-- features: vector (nullable = true)
    
    original convertion method:  [Row(label=0.0, features=SparseVector(4, {1: 1.0, 3: 5.5})), Row(label=1.0, features=SparseVector(4, {0: -1.0, 2: 0.5}))]
    
    
    <class 'pyspark.rdd.PipelinedRDD'> <class 'pyspark.rdd.RDD'>
    using map and toArray: [Row(features=DenseVector([0.0, 1.0, 0.0, 5.5]), label=0.0), Row(features=DenseVector([-1.0, 0.0, 0.5, 0.0]), label=1.0)]
    
    +------------------+-----+
    |          features|label|
    +------------------+-----+
    | [0.0,1.0,0.0,5.5]|  0.0|
    |[-1.0,0.0,0.5,0.0]|  1.0|
    +------------------+-----+
    
    0 讨论(0)
  • 2020-12-28 20:19

    You have to use VectorUDT here:

    # In Spark 1.x
    # from pyspark.mllib.linalg import SparseVector, VectorUDT
    from pyspark.ml.linalg import SparseVector, VectorUDT
    
    temp_rdd = sc.parallelize([
        (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
        (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])
    
    schema = StructType([
        StructField("label", DoubleType(), True),
        StructField("features", VectorUDT(), True)
    ])
    
    temp_rdd.toDF(schema).printSchema()
    
    ## root
    ##  |-- label: double (nullable = true)
    ##  |-- features: vector (nullable = true)
    

    Just for completeness Scala equivalent:

    import org.apache.spark.sql.Row
    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.types.{DoubleType, StructType}
    // In Spark 1x.
    // import org.apache.spark.mllib.linalg.{Vectors, VectorUDT}
    import org.apache.spark.ml.linalg.Vectors
    import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
    
    val schema = new StructType()
      .add("label", DoubleType)
       // In Spark 1.x
       //.add("features", new VectorUDT())
      .add("features",VectorType)
    
    val temp_rdd: RDD[Row]  = sc.parallelize(Seq(
      Row(0.0, Vectors.sparse(4, Seq((1, 1.0), (3, 5.5)))),
      Row(1.0, Vectors.sparse(4, Seq((0, -1.0), (2, 0.5))))
    ))
    
    spark.createDataFrame(temp_rdd, schema).printSchema
    
    // root
    // |-- label: double (nullable = true)
    // |-- features: vector (nullable = true)
    
    0 讨论(0)
  • 2020-12-28 20:23

    this is an example in scala for spark 2.1

    import org.apache.spark.ml.linalg.Vector
    
    def featuresRDD2DataFrame(features: RDD[Vector]): DataFrame = {
        import sparkSession.implicits._
        val rdd: RDD[(Double, Vector)] = features.map(x => (0.0, x))
        val df = rdd.toDF("label","features").select("features")
        df
      }
    

    the toDF() was not recognized by the compiler on the features rdd

    0 讨论(0)
提交回复
热议问题