Stratified Train/Test-split in scikit-learn

后端 未结 7 2125

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:

X, Xt, userInfo, userInfo_train = sklearn.cros         


        
相关标签:
7条回答
  • 2020-11-27 03:31

    Here's an example for continuous/regression data (until this issue on GitHub is resolved).

    min = np.amin(y)
    max = np.amax(y)
    
    # 5 bins may be too few for larger datasets.
    bins     = np.linspace(start=min, stop=max, num=5)
    y_binned = np.digitize(y, bins, right=True)
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y, 
        stratify=y_binned
    )
    
    • Where start is min and stop is max of your continuous target.
    • If you don't set right=True then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin.
    0 讨论(0)
提交回复
热议问题