问题
I have dataset with 12 columns + 1 target (binary) and about 4000 rows. I need to split it into train (70%), validation (20%) and test (10%) set.
The dataset is quite undersampled (95% of class 0 to 5% of class 1) so I need to keep the ratio of target in each sample.
I am able to split the dataset somehow, but I have no idea how to keep the ratio.
I am working with subset Wine Quality data here
回答1:
If you have access to Matlab's Statistical processing toolbox you can used the cvpartition
function.
From matlab help on cvpartition
-:
c = cvpartition(group,'HoldOut',p) randomly partitions observations into a training set and a test set with stratification, using the class information in group; that is, both training and test sets have roughly the same class proportions as in group.
You can apply the function twice to get three partitions. This function preserves the original class distribution.
回答2:
So far I came up with this, if anyone knows better solution, let me know. I split my dataset by target column, then each of those two splits were further split into first 70%, next 20% and last 10% data and then merged together. After, I split features and targets.
%split in 0/1 samples
winedataset_0 = winedataset(winedataset(:, 13) == 0, :);
winedataset_1 = winedataset(winedataset(:, 13) == 1, :);
%train
split_tr_0 = round(length(winedataset_0)*0.7);
split_tr_1 = round(length(winedataset_1)*0.7);
train_0 = winedataset_0(1:split_tr_0,:);
train_1 = winedataset_1(1:split_tr_1,:);
train_set = vertcat(train_0, train_1);
train_set = train_set(randperm(length(train_set)),:);
%valid
split_valid_0 = split_tr_0 + round(length(winedataset_0)*0.2);
split_valid_1 = split_tr_1 + round(length(winedataset_1)*0.2);
valid_0 = winedataset_0(split_tr_0+1:split_valid_0,:);
valid_1 = winedataset_1(split_tr_1+1:split_valid_1,:);
valid_set = vertcat(valid_0, valid_1);
valid_set = valid_set(randperm(length(valid_set)),:);
%test
test_0 = winedataset_0(split_valid_0+1:end,:);
test_1 = winedataset_1(split_valid_1+1:end,:);
test_set = vertcat(test_0, test_1);
test_set = test_set(randperm(length(test_set)),:);
%Split into X and y
X_train = train_set(:,1:12);
y_train = train_set(:,13);
X_valid = valid_set(:,1:12);
y_valid = valid_set(:,13);
X_test = test_set(:,1:12);
y_test = test_set(:,13);
来源:https://stackoverflow.com/questions/36674651/matlab-split-into-train-valid-test-set-and-keep-proportion