I have a matrix with 20 columns. The last column are 0/1 labels.
The link to the data is here.
I am trying to run random forest on the dataset, using cross valid
The answer is what @KCzar pointed. Just want to note the easiest way I found to randomize data(X
and y
with the same index shuffling) is as following:
p = np.random.permutation(len(X))
X, y = X[p], y[p]
source: Better way to shuffle two numpy arrays in unison
When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator:
http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics
http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold
By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with.
The KFolds iterator has a random state parameter:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
So does train_test_split, which does randomize by default:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
Patterns like what you described are usually a result of a lack of randomnesss in the train/test set.