问题
I am new to pytorch and are trying to implement a feed forward neural network to classify the mnist data set. I have some problems when trying to use cross-validation. My data has the following shapes:
x_train
:
torch.Size([45000, 784])
and
y_train
: torch.Size([45000])
I tried to use KFold from sklearn.
kfold =KFold(n_splits=10)
Here is the first part of my train method where I'm dividing the data into folds:
for train_index, test_index in kfold.split(x_train, y_train):
x_train_fold = x_train[train_index]
x_test_fold = x_test[test_index]
y_train_fold = y_train[train_index]
y_test_fold = y_test[test_index]
print(x_train_fold.shape)
for epoch in range(epochs):
...
The indices for the y_train_fold
variable is right, it's simply:
[ 0 1 2 ... 4497 4498 4499]
, but it's not for x_train_fold
, which is [ 4500 4501 4502 ... 44997 44998 44999]
. And the same goes for the test folds.
For the first iteration I want the varibale x_train_fold
to be the first 4500 pictures, in other words to have the shape torch.Size([4500, 784])
, but it has the shape torch.Size([40500, 784])
Any tips on how to get this right?
回答1:
I think you're confused!
Ignore the second dimension for a while, When you've 45000 points, and you use 10 fold cross-validation, what's the size of each fold? 45000/10 i.e. 4500.
It means that each of your fold will contain 4500 data points, and one of those fold will be used for testing, and the remaining for training i.e.
For testing: one fold => 4500 data points => size: 4500
For training: remaining folds => 45000-4500 data points => size: 45000-4500=40500
Thus, for first iteration, the first 4500 data points (corresponding to indices) will be used for testing and the rest for training. (Check below image)
Given your data is x_train: torch.Size([45000, 784])
and y_train: torch.Size([45000])
, this is how your code should look like:
for train_index, test_index in kfold.split(x_train, y_train):
print(train_index, test_index)
x_train_fold = x_train[train_index]
y_train_fold = y_train[train_index]
x_test_fold = x_train[test_index]
y_test_fold = y_train[test_index]
print(x_train_fold.shape, y_train_fold.shape)
print(x_test_fold.shape, y_test_fold.shape)
break
[ 4500 4501 4502 ... 44997 44998 44999] [ 0 1 2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])
So, when you say
I want the variable
x_train_fold
to be the first 4500 picture... shape torch.Size([4500, 784]).
you're wrong. this size corresonds to x_test_fold
. In the first iteration, based on 10 folds, x_train_fold
will have 40500 points, thus its size is supposed to be torch.Size([40500, 784])
.
回答2:
You messed with indices.
x_train = x[train_index]
x_test = x[test_index]
y_train = y[train_index]
y_test = y[test_index]
x_fold = x_train[train_index] y_fold = y_train[test_index]
It should be:
x_fold = x_train[train_index]
y_fold = y_train[train_index]
回答3:
Think I have it right now, but I feel the code is a bit messy, with 3 nested loops. Is there any simpler way to it or is this approach okay?
Here's my code for the training with cross validation:
def train(network, epochs, save_Model = False):
total_acc = 0
for fold, (train_index, test_index) in enumerate(kfold.split(x_train, y_train)):
### Dividing data into folds
x_train_fold = x_train[train_index]
x_test_fold = x_train[test_index]
y_train_fold = y_train[train_index]
y_test_fold = y_train[test_index]
train = torch.utils.data.TensorDataset(x_train_fold, y_train_fold)
test = torch.utils.data.TensorDataset(x_test_fold, y_test_fold)
train_loader = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = False)
test_loader = torch.utils.data.DataLoader(test, batch_size = batch_size, shuffle = False)
for epoch in range(epochs):
print('\nEpoch {} / {} \nFold number {} / {}'.format(epoch + 1, epochs, fold + 1 , kfold.get_n_splits()))
correct = 0
network.train()
for batch_index, (x_batch, y_batch) in enumerate(train_loader):
optimizer.zero_grad()
out = network(x_batch)
loss = loss_f(out, y_batch)
loss.backward()
optimizer.step()
pred = torch.max(out.data, dim=1)[1]
correct += (pred == y_batch).sum()
if (batch_index + 1) % 32 == 0:
print('[{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy:{:.3f}%'.format(
(batch_index + 1)*len(x_batch), len(train_loader.dataset),
100.*batch_index / len(train_loader), loss.data, float(correct*100) / float(batch_size*(batch_index+1))))
total_acc += float(correct*100) / float(batch_size*(batch_index+1))
total_acc = (total_acc / kfold.get_n_splits())
print('\n\nTotal accuracy cross validation: {:.3f}%'.format(total_acc))
来源:https://stackoverflow.com/questions/58996242/cross-validation-for-mnist-dataset-with-pytorch-and-sklearn