Working with Sklearn stratified kfold split, and when I attempt to split using multi-class, I received on error (see below). When I tried and split using binary, it works n
keras.utils.to_categorical
produces a one-hot encoded class vector, i.e. the multilabel-indicator
mentioned in the error message. StratifiedKFold
is not designed to work with such input; from the split
method docs:
split
(X, y, groups=None)[...]
y : array-like, shape (n_samples,)
The target variable for supervised learning problems. Stratification is done based on the y labels.
i.e. your y
must be a 1-D array of your class labels.
Essentially, what you have to do is simply to invert the order of the operations: split first (using your intial y_train
), and convert to_categorical
afterwards.
In my case, x
was a 2D matrix, and y
was also a 2d matrix, i.e. indeed a multi-class multi-output case. I just passed a dummy np.zeros(shape=(n,1))
for the y
and the x
as usual. Full code example:
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [3, 7], [9, 4]])
# y = np.array([0, 0, 1, 1, 0, 1]) # <<< works
y = X # does not work if passed into `.split`
rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=36851234)
for train_index, test_index in rskf.split(X, np.zeros(shape=(X.shape[0], 1))):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Call to split()
like this:
for i, (train_index, val_index) in enumerate(kf.split(x_train, y_train_categorical.argmax(1))):
x_train_kf, x_val_kf = x_train[train_index], x_train[val_index]
y_train_kf, y_val_kf = y_train[train_index], y_train[val_index]
I bumped into the same problem and found out that you can check the type of the target with this util
function:
from sklearn.utils.multiclass import type_of_target
type_of_target(y)
'multilabel-indicator'
From its docstring:
- 'binary':
y
contains <= 2 discrete values and is 1d or a column vector.- 'multiclass':
y
contains more than two discrete values, is not a sequence of sequences, and is 1d or a column vector.- 'multiclass-multioutput':
y
is a 2d array that contains more than two discrete values, is not a sequence of sequences, and both dimensions are of size > 1.- 'multilabel-indicator':
y
is a label indicator matrix, an array of two dimensions with at least two columns, and at most 2 unique values.
With LabelEncoder
you can transform your classes into an 1d array of numbers (given your target labels are in an 1d array of categoricals/object):
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(target_labels)