Spark best approach Look-up Dataframe to improve performance

后端 未结 2 1866
走了就别回头了
走了就别回头了 2021-01-24 12:10

Dataframe A (millions of records) one of the column is create_date,modified_date

Dataframe B 500 records has start_date and end_date

Current approach:

<

2条回答
  •  滥情空心
    2021-01-24 13:03

    DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.

    https://issues.apache.org/jira/browse/SPARK-16614

    You can use the RDD API to take advantage of the joinWithCassandraTable function

    https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

提交回复
热议问题