It seems like KFold generates the same values every time the object is iterated over, while Shuffle Split generates different indices every time. Is this correct? If so, what ar
Difference in KFold and ShuffleSplit output
KFold will divide your data set into prespecified number of folds, and every sample must be in one and only one fold. A fold is a subset of your dataset.
ShuffleSplit will randomly sample your entire dataset during each iteration to generate a training set and a test set. The test_size
and train_size
parameters control how large the test and training test set should be for each iteration. Since you are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration.
Summary: ShuffleSplit works iteratively, KFold just divides the dataset into k folds.
Difference when doing validation
In KFold, during each round you will use one fold as the test set and all the remaining folds as your training set. However, in ShuffleSplit, during each round n
you should only use the training and test set from iteration n
. As your data set grows, cross validation time increases, making shufflesplits a more attractive alternate. If you can train your algorithm, with a certain percentage of your data as opposed to using all k-1 folds, ShuffleSplit is an attractive option.