Scikit-learn balanced subsampling

前端 未结 13 1554
终归单人心
终归单人心 2020-12-02 10:34

I\'m trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it m

相关标签:
13条回答
  • 2020-12-02 11:20

    This type of data splitting is not provided among the built-in data splitting techniques exposed in sklearn.cross_validation.

    What seems similar to your needs is sklearn.cross_validation.StratifiedShuffleSplit, which can generate subsamples of any size while retaining the structure of the whole dataset, i.e. meticulously enforcing the same unbalance that is in your main dataset. While this is not what you are looking for, you may be able to use the code therein and change the imposed ratio to 50/50 always.

    (This would probably be a very good contribution to scikit-learn if you feel up to it.)

    0 讨论(0)
提交回复
热议问题