This question explains how Spark\'s random split works, How does Sparks RDD.randomSplit actually split the RDD, but I don\'t understand how spark keeps track of what values went
It's exactly the same as sampling an RDD.
Assuming you have the weight array (0.6, 0.2, 0.2)
, Spark will generate one DataFrame for each range (0.0, 0.6), (0.6, 0.8), (0.8, 1.0)
.
When it's time to read the result DataFrame, Spark will just go over the parent DataFrame. For each item, generate a random number, if that number fall in the the specified range, then emit the item. All child DataFrame share the same random number generator (technically, different generators with the same seed), so the sequence of random number is deterministic.
For your last question, if you did not cache the parent DataFrame, then the data for the input DataFrame will be re-fetch each time an output DataFrame is computed.