Drop function not working after left outer join in pyspark

泪湿孤枕 提交于 2020-01-06 03:26:33

问题


My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:

a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)

b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)

c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)

c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]

The drop function is not removing the columns.

But if I try to do:

c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)

Then priority column for a_df gets dropped.

Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.

I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?

Thanks in advance.


回答1:


Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.

But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

From the databricks:

If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.

When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.

df = left.join(right, ["priority"]) 


来源:https://stackoverflow.com/questions/54633251/drop-function-not-working-after-left-outer-join-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!