update a dataframe column with new values

前端未结

关注

 2  986

暖寄归人

df1 has fields id and json; df2 has fields idand json

df1.count() => 1200; df2.count()

相关标签:

2条回答

情书的邮戳

2020-12-17 05:14

If you want data from both the dataframe you can union two dataframe

import spark.implicits._

First Dataframe

val df1 = Seq(
  (1, "a"),
  (2, "b"),
  (3, "c")
).toDF("id", "value")

Second dataframe

val df2 = Seq(
  (1, "x"), 
  (2, "y")
).toDF("id", "value")

To get the result as both the data from df1 and df2, use union

val resultDF = df1.union(df2)

resultDF.show()

Output :

+---+-----+
|id |value|
+---+-----+
|1  |a    |
|2  |b    |
|3  |c    |
|1  |x    |
|2  |y    |
+---+-----+

0 讨论(0)

长情又很酷

2020-12-17 05:32

You can achieve this using one left join.

Create Example DataFrames

Using the sample data provided by @Shankar Koirala in his answer.

data1 = [
  (1, "a"),
  (2, "b"),
  (3, "c")
]
df1 = sqlCtx.createDataFrame(data1, ["id", "value"])

data2 = [
  (1, "x"), 
  (2, "y")
]

df2 = sqlCtx.createDataFrame(data2, ["id", "value"])

Do a left join

Join the two DataFrames using a left join on the id column. This will keep all of the rows in the left DataFrame. For the rows in the right DataFrame that don't have a matching id, the value will be null.

import pyspark.sql.functions as f
df1.alias('l').join(df2.alias('r'), on='id', how='left')\
    .select(
        'id',
         f.col('l.value').alias('left_value'),
         f.col('r.value').alias('right_value')
    )\
    .show()
#+---+----------+-----------+
#| id|left_value|right_value|
#+---+----------+-----------+
#|  1|         a|          x|
#|  3|         c|       null|
#|  2|         b|          y|
#+---+----------+-----------+

Select the desired data

We will use the fact that the unmatched ids have a null to select the final columns. Use pyspark.sql.functions.when() to use the right value if it is not null, otherwise keep the left value.

df1.alias('l').join(df2.alias('r'), on='id', how='left')\
    .select(
        'id',
        f.when(
            ~f.isnull(f.col('r.value')),
            f.col('r.value')
        ).otherwise(f.col('l.value')).alias('value')
    )\
    .show()
#+---+-----+
#| id|value|
#+---+-----+
#|  1|    x|
#|  3|    c|
#|  2|    y|
#+---+-----+

You can sort this output if you want the ids in order.

Using pyspark-sql

You can do the same thing using a pyspark-sql query:

df1.registerTempTable('df1')
df2.registerTempTable('df2')

query = """SELECT l.id, 
CASE WHEN r.value IS NOT NULL THEN r.value ELSE l.value END AS value 
FROM df1 l LEFT JOIN df2 r ON l.id = r.id"""
sqlCtx.sql(query.replace("\n", "")).show()
#+---+-----+
#| id|value|
#+---+-----+
#|  1|    x|
#|  3|    c|
#|  2|    y|
#+---+-----+

0 讨论(0)