How does Spark keep track of the splits in randomSplit?

后端 未结 1 599
孤独总比滥情好
孤独总比滥情好 2021-01-21 06:25

This question explains how Spark\'s random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don\'t understand how spark keeps track of what values went

相关标签:
1条回答
  • 2021-01-21 06:58

    It's exactly the same as sampling an RDD.

    Assuming you have the weight array (0.6, 0.2, 0.2), Spark will generate one DataFrame for each range (0.0, 0.6), (0.6, 0.8), (0.8, 1.0).

    When it's time to read the result DataFrame, Spark will just go over the parent DataFrame. For each item, generate a random number, if that number fall in the the specified range, then emit the item. All child DataFrame share the same random number generator (technically, different generators with the same seed), so the sequence of random number is deterministic.

    For your last question, if you did not cache the parent DataFrame, then the data for the input DataFrame will be re-fetch each time an output DataFrame is computed.

    0 讨论(0)
提交回复
热议问题