How to left out join two big tables effectively
问题 I have two tables, table_a and table_b, table_a contains 216646500 rows, 7155998163 bytes; table_b contains 1462775 rows, 2096277141 bytes table_a's schema is: c_1, c_2, c_3, c_4 ; table_b's schema is: c_2, c_5, c_6, ... (about 10 columns) I want to do a left_outer join the two tables on the same key col_2, but it has run for 16 hours and hasn't finished yet... The pyspark code is as follow: combine_table = table_a.join(table_b, table_a.col_2 == table_b.col_2, 'left_outer').collect() Is there