Loading a huge dataset batch-wise to train pytorch

问题

I am training a LSTM in-order to classify the time-series data into 2 classes(0 and 1).I have huge data-set on the drive where where the 0-class and the 1-class data are located in different folders.I am trying to train the LSTM batch-wise using by creating a Dataset class and wrapping the DataLoader around it. I have to do pre-processing such as reshaping.Here's my code which does that

class LoadingDataset(Dataset):
  def __init__(self,data_root1,data_root2,file_name):
    self.data_root1=data_root1#Has the path for class1 data
    self.data_root2=data_root2#Has the path for class0 data
    self.fileap1= pd.DataFrame()#Stores class 1 data
    self.fileap0 = pd.DataFrame()#Stores class 0 data
    self.file_name=file_name#List of all the files at data_root1 and data_root2
    self.labs1=None #Will store the class 1 labels
    self.labs0=None #Will store the class 0 labels

  def __len__(self):
    return len(self.fileap1) 

  def __getitem__(self, index):        
    self.fileap1 = pd.read_csv(self.data_root1+self.file_name[index],header=None)#read the csv file for class 1
    self.fileap1=self.fileap1.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
    self.fileap0 = pd.read_csv(self.data_root2+self.file_name[index],header=None)#read the csv file for class 0
    self.fileap0=self.fileap0.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
    self.labs1=np.array([1]*len(self.fileap1)).reshape(-1,1)#create the labels 1 for the csv file
    self.labs0=np.array([0]*len(self.fileap0)).reshape(-1,1)#create the labels 0 for the csv file
    # print(self.fileap1.shape,' ',self.fileap0.shape)
    # print(self.labs1.shape,' ',self.labs0.shape)
    self.fileap1=np.append(self.fileap1,self.fileap0,axis=0)#combine the class 0 and class one data
    self.fileap1 = torch.from_numpy(self.fileap1).float()
    self.labs1=np.append(self.labs1,self.labs0,axis=0)#combine the label0 and label 1 data
    self.labs1 = torch.from_numpy(self.labs1).int()
    # print(self.fileap1.shape,' ',self.fileap0.shape)
    # print(self.labs1.shape,' ',self.labs0.shape)

    return self.fileap1,self.labs1

data_root1 = '/content/gdrive/My Drive/Data/Processed_Data/Folder1/One_'#location of class 1 data
data_root2 = '/content/gdrive/My Drive/Data/Processed_Data/Folder0/Zero_'#location of class 0 data
training_set=LoadingDataset(data_root1,data_root2,train_ind)#train_ind is a list of file names that have to be read from data_root1 and data_root2
training_generator = DataLoader(training_set,batch_size =2,num_workers=4)

for epoch in range(num_epochs):
  model.train()#Setting the model to train mode after eval mode to train for next epoch once the testing for that epoch is finished
  for i, (inputs, targets) in enumerate(train_loader):
    .
    .
    .
    .

` I get this error when the run this code

RuntimeError: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 68, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 68, in return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 43, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 96596 and 25060 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:711

My Questions are 1.Have I Implemented this correctly, is this how you pre-process and then train a dataset batch-wise?

2.The batch_size of DataLoader and batch_size of the LSTM are different since the batch_size of DataLoader refers to the no. of files, whereas batch_size of the LSTM model refers to the no. of instances, so will I get another error here?

3.I have no idea how to scale this data-set since the MinMaxScaler has to be applied to the dataset in its entirety.

Responses are appreciated.Please let me know if I have to create separate posts for each question.

Thank You.

回答1:

Here's a summary of how pytorch does things :

You have a dataset, that is an object with a __len__ method and a __getitem__ method.
You create a dataloader from that dataset and a collate_fn
You iterate through the dataloader and pass a batch of data to your model.

So basically your training loop will look like

for x, y in dataloader:
    output = model(x)
...

for x, y in dataloader:
        output = model(*x)
    ...

if your model forward method takes multiple arguments.

So how does this work ? Basically you have a generator of batch indices batch_sampler and here's what looping inside your dataloader will act like.

for indices in batch_sampler:
    yield collate_fn([dataset[i] for i in indices])

So if you want everything to work well, you must look at the forward method of your model and see how many arguments it takes (In my experience forward method of LSTM can have multiple arguments), and make sure that you use a collate_fn to pass those correctly.

来源：https://stackoverflow.com/questions/57644452/loading-a-huge-dataset-batch-wise-to-train-pytorch

标签

python

machine-learning

dataset

artificial-intelligence

pytorch