发表新帖

发表新帖

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

后端未结

关注

 2  664

I have a matrix with 20 columns. The last column are 0/1 labels.

The link to the data is here.

I am trying to run random forest on the dataset, using cross valid

相关标签:

2条回答

清歌不尽

2021-02-04 11:10
The answer is what @KCzar pointed. Just want to note the easiest way I found to randomize data(X and y with the same index shuffling) is as following:
```
p = np.random.permutation(len(X))
X, y = X[p], y[p]
```
source: Better way to shuffle two numpy arrays in unison
0 讨论(0)
发布评论:

提交评论
- 加载中...
心在旅途

2021-02-04 11:11

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:

http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics

http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold

By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.

The KFolds iterator has a random state parameter:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html

So does train_test_split, which does randomize by default:

http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题