Spark best approach Look-up Dataframe to improve performance

后端 未结 2 1854
走了就别回头了
走了就别回头了 2021-01-24 12:10

Dataframe A (millions of records) one of the column is create_date,modified_date

Dataframe B 500 records has start_date and end_date

Current approach:

<

相关标签:
2条回答
  • 2021-01-24 13:03

    DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.

    https://issues.apache.org/jira/browse/SPARK-16614

    You can use the RDD API to take advantage of the joinWithCassandraTable function

    https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

    0 讨论(0)
  • 2021-01-24 13:07

    As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.

    spark.sql.autoBroadcastJoinThreshold
    

    If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.

    0 讨论(0)
提交回复
热议问题