df1 has fields id
and json
; df2 has fields id
and json
df1.count()
=> 1200; df2.count()
If you want data from both the dataframe you can union
two dataframe
import spark.implicits._
First Dataframe
val df1 = Seq(
(1, "a"),
(2, "b"),
(3, "c")
).toDF("id", "value")
Second dataframe
val df2 = Seq(
(1, "x"),
(2, "y")
).toDF("id", "value")
To get the result as both the data from df1
and df2
, use union
val resultDF = df1.union(df2)
resultDF.show()
Output :
+---+-----+
|id |value|
+---+-----+
|1 |a |
|2 |b |
|3 |c |
|1 |x |
|2 |y |
+---+-----+
You can achieve this using one left join.
Create Example DataFrames
Using the sample data provided by @Shankar Koirala in his answer.
data1 = [
(1, "a"),
(2, "b"),
(3, "c")
]
df1 = sqlCtx.createDataFrame(data1, ["id", "value"])
data2 = [
(1, "x"),
(2, "y")
]
df2 = sqlCtx.createDataFrame(data2, ["id", "value"])
Do a left join
Join the two DataFrames using a left join on the id
column. This will keep all of the rows in the left DataFrame. For the rows in the right DataFrame that don't have a matching id
, the value will be null
.
import pyspark.sql.functions as f
df1.alias('l').join(df2.alias('r'), on='id', how='left')\
.select(
'id',
f.col('l.value').alias('left_value'),
f.col('r.value').alias('right_value')
)\
.show()
#+---+----------+-----------+
#| id|left_value|right_value|
#+---+----------+-----------+
#| 1| a| x|
#| 3| c| null|
#| 2| b| y|
#+---+----------+-----------+
Select the desired data
We will use the fact that the unmatched id
s have a null
to select the final columns. Use pyspark.sql.functions.when() to use the right value if it is not null, otherwise keep the left value.
df1.alias('l').join(df2.alias('r'), on='id', how='left')\
.select(
'id',
f.when(
~f.isnull(f.col('r.value')),
f.col('r.value')
).otherwise(f.col('l.value')).alias('value')
)\
.show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| x|
#| 3| c|
#| 2| y|
#+---+-----+
You can sort this output if you want the id
s in order.
Using pyspark-sql
You can do the same thing using a pyspark-sql query:
df1.registerTempTable('df1')
df2.registerTempTable('df2')
query = """SELECT l.id,
CASE WHEN r.value IS NOT NULL THEN r.value ELSE l.value END AS value
FROM df1 l LEFT JOIN df2 r ON l.id = r.id"""
sqlCtx.sql(query.replace("\n", "")).show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| x|
#| 3| c|
#| 2| y|
#+---+-----+