Spark SQL: How to append new row to dataframe table (from another table)

前端 未结 2 996
予麋鹿
予麋鹿 2020-12-28 21:16

I am using Spark SQL with dataframes. I have an input dataframe, and I would like to append (or insert) its rows to a larger dataframe that has more columns. How would I do

相关标签:
2条回答
  • 2020-12-28 21:57

    Spark DataFrames are immutable so it is not possible to append / insert rows. Instead you can just add missing columns and use UNION ALL:

    output.unionAll(input.select($"*", lit(""), current_timestamp.cast("long")))
    
    0 讨论(0)
  • 2020-12-28 22:18

    I had a similar problem matching to your SQL-Question:

    I wanted to append a dataframe to an existing hive table, which is also larger (more columns). To keep your example: output is my existing table and input could be the dataframe. My solution uses simply SQL and for the sake of completeness I want to provide it:

    import org.apache.spark.sql.SaveMode
    
    var input = spark.createDataFrame(Seq(
            (10L, "Joe Doe", 34),
            (11L, "Jane Doe", 31),
            (12L, "Alice Jones", 25)
            )).toDF("id", "name", "age")
    
    //--> just for a running example: In my case the table already exists
    var output = spark.createDataFrame(Seq(
            (0L, "Jack Smith", 41, "yes", 1459204800L),
            (1L, "Jane Jones", 22, "no", 1459294200L),
            (2L, "Alice Smith", 31, "", 1459595700L)
            )).toDF("id", "name", "age", "init", "ts")
    
    output.write.mode(SaveMode.Overwrite).saveAsTable("appendTest");
    //<--
    
    input.createOrReplaceTempView("inputTable");
    
    spark.sql("INSERT INTO TABLE appendTest SELECT id, name, age, null, null FROM inputTable");
    val df = spark.sql("SELECT * FROM appendTest")
    df.show()
    

    which outputs:

    +---+-----------+---+----+----------+
    | id|       name|age|init|        ts|
    +---+-----------+---+----+----------+
    |  0| Jack Smith| 41| yes|1459204800|
    |  1| Jane Jones| 22|  no|1459294200|
    |  2|Alice Smith| 31|    |1459595700|
    | 12|Alice Jones| 25|null|      null|
    | 11|   Jane Doe| 31|null|      null|
    | 10|    Joe Doe| 34|null|      null|
    +---+-----------+---+----+----------+
    

    If you may have the problem, that you don't know how much fields are missing, you could use a diff like

    val missingFields = output.schema.toSet.diff(input.schema.toSet)
    

    and then (in bad pseudo code)

    val sqlQuery = "INSERT INTO TABLE appendTest SELECT " + commaSeparatedColumnNames + commaSeparatedNullsForEachMissingField + " FROM inputTable"
    

    Hope to help people with future problems like that!

    P.S.: In your special case (current timestamp + empty field for init) you could even use

    spark.sql("INSERT INTO TABLE appendTest SELECT id, name, age, '' as init, current_timestamp as ts FROM inputTable");
    

    which results in

    +---+-----------+---+----+----------+
    | id|       name|age|init|        ts|
    +---+-----------+---+----+----------+
    |  0| Jack Smith| 41| yes|1459204800|
    |  1| Jane Jones| 22|  no|1459294200|
    |  2|Alice Smith| 31|    |1459595700|
    | 12|Alice Jones| 25|    |1521128513|
    | 11|   Jane Doe| 31|    |1521128513|
    | 10|    Joe Doe| 34|    |1521128513|
    +---+-----------+---+----+----------+
    
    0 讨论(0)
提交回复
热议问题