I am trying to implement a Siamese network that takes in two images. I load these images and create two separate dataloaders.
In my loop I want to go through both datalo
I see you are struggling to make a right dataloder function. I would do:
class Siamese(Dataset):
def __init__(self, transform=None):
#init data here
def __len__(self):
return #length of the data
def __getitem__(self, idx):
#get images and labels here
#returned images must be tensor
#labels should be int
return img1, img2 , label1, label2
To complete @ManojAcharya's answer:
The error you are getting comes neither from zip()
nor DataLoader()
directly. Python is trying to tell you that it couldn't find one of the data files you are asking for (c.f. FileNotFoundError
in the exception trace), probably in your Dataset
.
Find below a working example using DataLoader
and zip
together. Note that if you want to shuffle your data, it becomes difficult to keep the correspondences between the 2 datasets. This justifies @ManojAcharya's solution.
import torch
from torch.utils.data import DataLoader, Dataset
class DummyDataset(Dataset):
"""
Dataset of numbers in [a,b] inclusive
"""
def __init__(self, a=0, b=100):
super(DummyDataset, self).__init__()
self.a = a
self.b = b
def __len__(self):
return self.b - self.a + 1
def __getitem__(self, index):
return index, "label_{}".format(index)
dataloaders1 = DataLoader(DummyDataset(0, 9), batch_size=2, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 9), batch_size=2, shuffle=True)
for i, data in enumerate(zip(dataloaders1, dataloaders2)):
print(data)
# ([tensor([ 4, 7]), ('label_4', 'label_7')], [tensor([ 8, 5]), ('label_8', 'label_5')])
# ([tensor([ 1, 9]), ('label_1', 'label_9')], [tensor([ 6, 9]), ('label_6', 'label_9')])
# ([tensor([ 6, 5]), ('label_6', 'label_5')], [tensor([ 0, 4]), ('label_0', 'label_4')])
# ([tensor([ 8, 2]), ('label_8', 'label_2')], [tensor([ 2, 7]), ('label_2', 'label_7')])
# ([tensor([ 0, 3]), ('label_0', 'label_3')], [tensor([ 3, 1]), ('label_3', 'label_1')])
If you want to iterate over two datasets simultaneously, there is no need to define your own dataset class just use TensorDataset like below:
dataset = torch.utils.data.TensorDataset(dataset1, dataset2)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
for index, (xb1, xb2) in enumerate(dataloader):
....
If you want the labels or iterating over more than two datasets just feed them as an argument to the TensorDataset after dataset2.
Adding on @Aldream's solution for the case when we have varying length of the dataset and if we want to pass through them all at same epoch then we could use the cycle()
from itertools
, a Python Standard library. Using the code snippet of @Aldrem, the updated code will look like:
from torch.utils.data import DataLoader, Dataset
from itertools import cycle
class DummyDataset(Dataset):
"""
Dataset of numbers in [a,b] inclusive
"""
def __init__(self, a=0, b=100):
super(DummyDataset, self).__init__()
self.a = a
self.b = b
def __len__(self):
return self.b - self.a + 1
def __getitem__(self, index):
return index
dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10
for epoch in range(num_epochs):
for i, data in enumerate(zip(cycle(dataloaders1), dataloaders2)):
print(data)
With only zip()
the iterator will be exhausted when the length is equal to that of the smallest dataset (here 100). But with the use of cycle()
, we will repeat the smallest dataset again unless our iterator looks at all the samples from the largest dataset (here 200).
P.S. One can always argue this approach may not be required to achieve convergence as long as one does samples randomly but with this approach, the evaluation might be easier.
Further to what it is already mentioned, cycle()
and zip()
might create a memory leakage problem - especially when using image datasets! To solve that, instead of iterating like this:
dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10
for epoch in range(num_epochs):
for i, (data1, data2) in enumerate(zip(cycle(dataloaders1), dataloaders2)):
do_cool_things()
you could use:
dataloaders1 = DataLoader(DummyDataset(0, 100), batch_size=10, shuffle=True)
dataloaders2 = DataLoader(DummyDataset(0, 200), batch_size=10, shuffle=True)
num_epochs = 10
for epoch in range(num_epochs):
dataloader_iterator = iter(dataloaders1)
for i, data1 in enumerate(dataloaders2)):
try:
data2 = next(dataloader_iterator)
except StopIteration:
dataloader_iterator = iter(dataloaders1)
data2 = next(dataloader_iterator)
do_cool_things()
Bear in mind that if you use labels as well, you should replace in this example data1
with (inputs1,targets1)
and data2
with inputs2,targets2
, as @Sajad Norouzi said.
KUDOS to this one: https://github.com/pytorch/pytorch/issues/1917#issuecomment-433698337