Apache Spark how to append new column from list/array to Spark dataframe

后端 未结 2 1665
死守一世寂寞
死守一世寂寞 2020-11-29 12:53

I am using Apache Spark 2.0 Dataframe/Dataset API I want to add a new column to my dataframe from List of values. My list has same number of values like given dataframe.

相关标签:
2条回答
  • 2020-11-29 13:06

    Adding for completeness: the fact that the input list (which exists in driver memory) has the same size as the DataFrame suggests that this is a small DataFrame to begin with - so you might consider collect()-ing it, zipping with list, and converting back into a DataFrame if needed:

    df.collect()
      .map(_.getAs[String]("row1"))
      .zip(list).toList
      .toDF("row1", "row2")
    

    That won't be faster, but if the data is really small it might be negligible and the code is (arguably) clearer.

    0 讨论(0)
  • 2020-11-29 13:27

    You could do it like this:

    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._    
    
    // create rdd from the list
    val rdd = sc.parallelize(List(4,5,10,7,2))
    // rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:28
    
    // zip the data frame with rdd
    val rdd_new = df.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
    // rdd_new: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[33] at map at <console>:32
    
    // create a new data frame from the rdd_new with modified schema
    spark.createDataFrame(rdd_new, df.schema.add("new_col", IntegerType)).show
    +----+-------+
    |row1|new_col|
    +----+-------+
    |   a|      4|
    |   b|      5|
    |   c|     10|
    |   d|      7|
    |   e|      2|
    +----+-------+
    
    0 讨论(0)
提交回复
热议问题