How does Spark keep track of the splits in randomSplit?

后端未结

关注

 1  601

This question explains how Spark\'s random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don\'t understand how spark keeps track of what values went

相关标签:

1条回答

眼角桃花

2021-01-21 06:58

It's exactly the same as sampling an RDD.

Assuming you have the weight array (0.6, 0.2, 0.2), Spark will generate one DataFrame for each range (0.0, 0.6), (0.6, 0.8), (0.8, 1.0).

When it's time to read the result DataFrame, Spark will just go over the parent DataFrame. For each item, generate a random number, if that number fall in the the specified range, then emit the item. All child DataFrame share the same random number generator (technically, different generators with the same seed), so the sequence of random number is deterministic.

For your last question, if you did not cache the parent DataFrame, then the data for the input DataFrame will be re-fetch each time an output DataFrame is computed.

0 讨论(0)
发布评论:

提交评论
- 加载中...