Difference between using train_test_split and cross_val_score in sklearn.cross_validation

后端 未结 2 664
抹茶落季
抹茶落季 2021-02-04 10:40

I have a matrix with 20 columns. The last column are 0/1 labels.

The link to the data is here.

I am trying to run random forest on the dataset, using cross valid

相关标签:
2条回答
  • 2021-02-04 11:10

    The answer is what @KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:

    p = np.random.permutation(len(X))
    X, y = X[p], y[p]
    

    source: Better way to shuffle two numpy arrays in unison

    0 讨论(0)
  • 2021-02-04 11:11

    When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

    http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

    http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

    By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

    The KFolds iterator has a random state parameter:

    http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

    So does train_test_split, which does randomize by default:

    http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

    Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

    0 讨论(0)
提交回复
热议问题