问题
I have 2 datasets.
Example Dataset 1:
id | model | first_name | last_name
-----------------------------------------------------------
1234 | 32 | 456765 | [456700,987565]
-----------------------------------------------------------
4539 | 20 | 123211 | [893456,123456]
-----------------------------------------------------------
Some times one of the columns first_name and last_name is empty.
Example dataset 2:
number | matricule | name | model
----------------------------------------------------------
AA | 0009 | 456765 | 32
----------------------------------------------------------
AA | 0009 | 893456 | 32
----------------------------------------------------------
AA | 0009 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 987565 | 32
For one matricule
we can find more name
and model
, like in my example just above.
What I should do:
For each row from the Dataset 1, I take the 3 columns: model, first_name and last_name and look for them in Dataset 2 if exist / match according the matricule elements.
I should compare:
model by model ==> if model (dataset 1) exist in model (dataset 2) ==> match
if first_name exist in name ==> no match. If first_name not exist in name ==> match
if last_name exist in name ==> match. When I have two values of last_name, the both should exist in name of dataset 2 to be matched.
Example:
Rows 1 from Dataset 1 is:
id | model | first_name | last_name
------------------------------------------------------
1234 | 32 | 456765 | [456700,987565]
For matricule 0009 in dataset 2, I have:
number | matricule | name | model
----------------------------------------------------------
AA | 0009 | 456765 | 32
----------------------------------------------------------
AA | 0009 | 893456 | 32
----------------------------------------------------------
AA | 0009 | 456700 | 32
So:
first_name (456765) is exist in name of dataset 2 when matriule =0009 ==> no match
last_name, only 456700 is exist ==> no match
model (32) is exist in model of dataset 2 ==> match
So I skip the matricule 0009. And pass to compare second line in dataset 1 with the elements of matricule 0008.
For matricule 0008 in dataset 2, I have:
----------------------------------------------------------
AA | 0008 | 456700 | 32
----------------------------------------------------------
AA | 0008 | 987565 | 32
Always we are in the first rows of dataset 1:
first_name (456765) is not exist in name of dataset 2 when matricule=0008 ==> match
last_name, the both values are exist in name of dataset 2 when matricule = 0008, ==> match
model is exist in model of dataset 2 when matricule =0008==> match
When I find all match, I create a new dataset contain:
number | id | matricule
-----------------------------------
AA | 1234 | 0008
-----------------------------------
I hope that I was clear. Someone can help me please.
回答1:
You can use join on the conditions of matching.
First, you can group by the second DataFrame and collect name
column into a list:
df2 = df2.groupBy("number", "model", "matricule").agg(collect_list("name").alias("names"))
f2.show(truncate=False)
#+------+-----+---------+------------------------+
#|number|model|matricule|names |
#+------+-----+---------+------------------------+
#|AA |32 |0009 |[456765, 893456, 456700]|
#|AA |32 |0008 |[456700, 987565] |
#+------+-----+---------+------------------------+
Now, join df1
and df2
. For conditions 1 and 2, it is somehow simple to check.
For the third one, you can use array_except avaliable from Spark 2.4+ (there should be no elements from last_name
column that are not in names
and vice-versa):
join_condition = (col("df1.model") == col("df2.model")) \
& ~expr("array_contains(df2.names, df1.first_name)") \
& (size(expr("array_except(df2.names, df1.last_name)")) == lit(0)) \
& (size(expr("array_except(df1.last_name, df2.names)")) == lit(0))
df_result = df1.alias("df1").join(df2.alias("df2"), join_condition)
Finally, select desired columns from the join result:
df_result.select("number", "id", "matricule").show(truncate=False)
#+------+----+---------+
#|number|id |matricule|
#+------+----+---------+
#|AA |1234|0008 |
#+------+----+---------+
来源:https://stackoverflow.com/questions/60190375/compare-two-datasets-in-pyspark