Pyspark Dataframe Join using UDF

后端 未结 1 852
野的像风
野的像风 2020-12-18 06:24

I\'m trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:

my_join_udf = udf(lambda x         


        
相关标签:
1条回答
  • 2020-12-18 07:16

    Spark 2.2+

    You have to use crossJoin or enable cross joins in the configuration:

    df1.crossJoin(df2).where(my_join_udf(df1.col_a, df2.col_b))
    

    Spark 2.0, 2.1

    Method shown below doesn't work anymore in Spark 2.x. See SPARK-19728.

    Spark 1.x

    Theoretically you can join and filter:

    df1.join(df2).where(my_join_udf(df1.col_a, df2.col_b))
    

    but in general you shouldn't to it all. Any type of join which is not based on equality requires a full Cartesian product (same as the answer) which is rarely acceptable (see also Why using a UDF in a SQL query leads to cartesian product?).

    0 讨论(0)
提交回复
热议问题