I am using Spark SQL with dataframes. I have an input dataframe, and I would like to append (or insert) its rows to a larger dataframe that has more columns. How would I do
Spark DataFrames
are immutable so it is not possible to append / insert rows. Instead you can just add missing columns and use UNION ALL
:
output.unionAll(input.select($"*", lit(""), current_timestamp.cast("long")))
I had a similar problem matching to your SQL-Question:
I wanted to append a dataframe to an existing hive table, which is also larger (more columns). To keep your example: output
is my existing table and input
could be the dataframe. My solution uses simply SQL and for the sake of completeness I want to provide it:
import org.apache.spark.sql.SaveMode
var input = spark.createDataFrame(Seq(
(10L, "Joe Doe", 34),
(11L, "Jane Doe", 31),
(12L, "Alice Jones", 25)
)).toDF("id", "name", "age")
//--> just for a running example: In my case the table already exists
var output = spark.createDataFrame(Seq(
(0L, "Jack Smith", 41, "yes", 1459204800L),
(1L, "Jane Jones", 22, "no", 1459294200L),
(2L, "Alice Smith", 31, "", 1459595700L)
)).toDF("id", "name", "age", "init", "ts")
output.write.mode(SaveMode.Overwrite).saveAsTable("appendTest");
//<--
input.createOrReplaceTempView("inputTable");
spark.sql("INSERT INTO TABLE appendTest SELECT id, name, age, null, null FROM inputTable");
val df = spark.sql("SELECT * FROM appendTest")
df.show()
which outputs:
+---+-----------+---+----+----------+
| id| name|age|init| ts|
+---+-----------+---+----+----------+
| 0| Jack Smith| 41| yes|1459204800|
| 1| Jane Jones| 22| no|1459294200|
| 2|Alice Smith| 31| |1459595700|
| 12|Alice Jones| 25|null| null|
| 11| Jane Doe| 31|null| null|
| 10| Joe Doe| 34|null| null|
+---+-----------+---+----+----------+
If you may have the problem, that you don't know how much fields are missing, you could use a diff
like
val missingFields = output.schema.toSet.diff(input.schema.toSet)
and then (in bad pseudo code)
val sqlQuery = "INSERT INTO TABLE appendTest SELECT " + commaSeparatedColumnNames + commaSeparatedNullsForEachMissingField + " FROM inputTable"
Hope to help people with future problems like that!
P.S.: In your special case (current timestamp + empty field for init) you could even use
spark.sql("INSERT INTO TABLE appendTest SELECT id, name, age, '' as init, current_timestamp as ts FROM inputTable");
which results in
+---+-----------+---+----+----------+
| id| name|age|init| ts|
+---+-----------+---+----+----------+
| 0| Jack Smith| 41| yes|1459204800|
| 1| Jane Jones| 22| no|1459294200|
| 2|Alice Smith| 31| |1459595700|
| 12|Alice Jones| 25| |1521128513|
| 11| Jane Doe| 31| |1521128513|
| 10| Joe Doe| 34| |1521128513|
+---+-----------+---+----+----------+