I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m
This type of data splitting is not provided among the built-in data splitting techniques exposed in sklearn.cross_validation
.
What seems similar to your needs is sklearn.cross_validation.StratifiedShuffleSplit
, which can generate subsamples of any size while retaining the structure of the whole dataset, i.e. meticulously enforcing the same unbalance that is in your main dataset. While this is not what you are looking for, you may be able to use the code therein and change the imposed ratio to 50/50 always.
(This would probably be a very good contribution to scikit-learn if you feel up to it.)