I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cros
Here's an example for continuous/regression data (until this issue on GitHub is resolved).
min = np.amin(y)
max = np.amax(y)
# 5 bins may be too few for larger datasets.
bins = np.linspace(start=min, stop=max, num=5)
y_binned = np.digitize(y, bins, right=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
stratify=y_binned
)
start
is min and stop
is max of your continuous target.right=True
then it will more or less make your max value a separate bin and your split will always fail because too few samples will be in that extra bin.