Why does Spark think this is a cross / Cartesian join

后端 未结 3 1057
感情败类
感情败类 2020-11-30 07:59

I want to join data twice as below:

rdd1 = spark.createDataFrame([(1, \'a\'), (2, \'b\'), (3, \'c\')], [\'idx\', \'val\'])
rdd2 = spark.createDataFrame([(1,          


        
相关标签:
3条回答
  • 2020-11-30 08:20

    I was also successful when persisted the dataframe before the second join.

    Something like:

    res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()
    
    res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))
    
    0 讨论(0)
  • 2020-11-30 08:27

    Persisting did not work for me.

    I overcame it with aliases on DataFrames

    from pyspark.sql.functions import col
    
    df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))
    
    0 讨论(0)
  • 2020-11-30 08:31

    This happens because you join structures sharing the same lineage and this leads to a trivially equal condition:

    res2.explain()
    

    == Physical Plan ==
    org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
    Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
    :- Filter isnotnull(idx#204L)
    :  +- LogicalRDD [idx#204L, val#205]
    +- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
       +- LogicalRDD [key1#209L, key2#210L, val#211L]
    and
    LogicalRDD [idx#235L, val#236]
    Join condition is missing or trivial.
    Use the CROSS JOIN syntax to allow cartesian products between these relations.;
    

    In case like this you should use aliases:

    from pyspark.sql.functions import col
    
    rdd1 = spark.createDataFrame(...).alias('rdd1')
    rdd2 = spark.createDataFrame(...).alias('rdd2')
    
    res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
    res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()
    
    == Physical Plan ==
    *SortMergeJoin [key2#297L], [idx#360L], Inner
    :- *Sort [key2#297L ASC NULLS FIRST], false, 0
    :  +- Exchange hashpartitioning(key2#297L, 200)
    :     +- *SortMergeJoin [idx#290L], [key1#296L], Inner
    :        :- *Sort [idx#290L ASC NULLS FIRST], false, 0
    :        :  +- Exchange hashpartitioning(idx#290L, 200)
    :        :     +- *Filter isnotnull(idx#290L)
    :        :        +- Scan ExistingRDD[idx#290L,val#291]
    :        +- *Sort [key1#296L ASC NULLS FIRST], false, 0
    :           +- Exchange hashpartitioning(key1#296L, 200)
    :              +- *Filter (isnotnull(key2#297L) && isnotnull(key1#296L))
    :                 +- Scan ExistingRDD[key1#296L,key2#297L,val#298L]
    +- *Sort [idx#360L ASC NULLS FIRST], false, 0
       +- Exchange hashpartitioning(idx#360L, 200)
          +- *Filter isnotnull(idx#360L)
             +- Scan ExistingRDD[idx#360L,val#361]
    

    For details see SPARK-6459.

    0 讨论(0)
提交回复
热议问题