问题
I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes.
From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions.
How could one perform a join of two DataFrames while preserving the order of one table?
E.g.,
+------------+---------+
| col1 | col2 |
+------------+---------+
| 0 | a |
| 1 | b |
+------------+---------+
joined with
+------------+---------+
| col2 | col3 |
+------------+---------+
| b | x |
| a | y |
+------------+---------+
on col2 should give
+------------+--------------------+
| col1 | col2 | col 3 |
+------------+---------+----------+
| 0 | a | y |
| 1 | b | x |
+------------+---------+----------+
I've heard some things about using coalesce
or repartition
, but I'm not sure. Any suggestions/methods/insights are appreciated.
Edit: would this be analogous to having one reducer in MapReduce? If so, how would that look like in Spark?
回答1:
It can't. You can add monotonically_increasing_id
and reorder data after join.
来源:https://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order