How to left out join two big tables effectively

冷暖自知 提交于 2019-12-11 07:01:05

问题


I have two tables, table_a and table_b, table_a contains 216646500 rows, 7155998163 bytes; table_b contains 1462775 rows, 2096277141 bytes

table_a's schema is: c_1, c_2, c_3, c_4 ; table_b's schema is: c_2, c_5, c_6, ... (about 10 columns)

I want to do a left_outer join the two tables on the same key col_2, but it has run for 16 hours and hasn't finished yet... The pyspark code is as follow:

combine_table = table_a.join(table_b, table_a.col_2 == table_b.col_2, 'left_outer').collect()

Is there any effictive way to join two big tables like this?


回答1:


Beware of exploding joins.

Working with an open dataset, this query won't run in a reasonable time:

#standardSQL
SELECT COUNT(*)
FROM `fh-bigquery.reddit_posts.2017_06` a
JOIN `fh-bigquery.reddit_comments.2017_06` b
ON a.subreddit=b.subreddit

What if we get rid of the top 100 joining keys from each side?

#standardSQL
SELECT COUNT(*)
FROM (
  SELECT * FROM `fh-bigquery.reddit_posts.2017_06`
  WHERE subreddit NOT IN (SELECT value FROM UNNEST((
  SELECT APPROX_TOP_COUNT(subreddit, 100) s
  FROM `fh-bigquery.reddit_posts.2017_06`
)))) a
JOIN (
  SELECT * FROM `fh-bigquery.reddit_comments.2017_06` b
  WHERE subreddit NOT IN (SELECT value FROM UNNEST((
  SELECT APPROX_TOP_COUNT(subreddit, 100) s
  FROM `fh-bigquery.reddit_comments.2017_06`
)))) b
ON a.subreddit=b.subreddit

This modified query ran in 70 seconds, and the result was:

90508538331

90 billion. That's a exploding join. We had 9 million rows in one table, 80 million rows in the second one, and our join produced 90 billion rows - even after eliminating the top 100 keys from each side.

In your data - look for any key that could be producing too many results, and remove it before producing the join (sometimes it's a default value, like null)



来源:https://stackoverflow.com/questions/46509038/how-to-left-out-join-two-big-tables-effectively

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!