问题
My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:
a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)
b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)
c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]
The drop function is not removing the columns.
But if I try to do:
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)
Then priority column for a_df gets dropped.
Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.
I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?
Thanks in advance.
回答1:
Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.
But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
From the databricks:
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.
When you join, instead you can try either using an alias
(thats typically what I use), or you can join the columns as an list
type or str
.
df = left.join(right, ["priority"])
来源:https://stackoverflow.com/questions/54633251/drop-function-not-working-after-left-outer-join-in-pyspark