问题
I am training a LSTM in-order to classify the time-series data into 2 classes(0 and 1).I have huge data-set on the drive where where the 0-class and the 1-class data are located in different folders.I am trying to train the LSTM batch-wise using by creating a Dataset class and wrapping the DataLoader around it. I have to do pre-processing such as reshaping.Here's my code which does that
`
class LoadingDataset(Dataset):
def __init__(self,data_root1,data_root2,file_name):
self.data_root1=data_root1#Has the path for class1 data
self.data_root2=data_root2#Has the path for class0 data
self.fileap1= pd.DataFrame()#Stores class 1 data
self.fileap0 = pd.DataFrame()#Stores class 0 data
self.file_name=file_name#List of all the files at data_root1 and data_root2
self.labs1=None #Will store the class 1 labels
self.labs0=None #Will store the class 0 labels
def __len__(self):
return len(self.fileap1)
def __getitem__(self, index):
self.fileap1 = pd.read_csv(self.data_root1+self.file_name[index],header=None)#read the csv file for class 1
self.fileap1=self.fileap1.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
self.fileap0 = pd.read_csv(self.data_root2+self.file_name[index],header=None)#read the csv file for class 0
self.fileap0=self.fileap0.iloc[1:,1:].values.reshape(-1,WINDOW+1,1)#reshape the file for lstm
self.labs1=np.array([1]*len(self.fileap1)).reshape(-1,1)#create the labels 1 for the csv file
self.labs0=np.array([0]*len(self.fileap0)).reshape(-1,1)#create the labels 0 for the csv file
# print(self.fileap1.shape,' ',self.fileap0.shape)
# print(self.labs1.shape,' ',self.labs0.shape)
self.fileap1=np.append(self.fileap1,self.fileap0,axis=0)#combine the class 0 and class one data
self.fileap1 = torch.from_numpy(self.fileap1).float()
self.labs1=np.append(self.labs1,self.labs0,axis=0)#combine the label0 and label 1 data
self.labs1 = torch.from_numpy(self.labs1).int()
# print(self.fileap1.shape,' ',self.fileap0.shape)
# print(self.labs1.shape,' ',self.labs0.shape)
return self.fileap1,self.labs1
data_root1 = '/content/gdrive/My Drive/Data/Processed_Data/Folder1/One_'#location of class 1 data
data_root2 = '/content/gdrive/My Drive/Data/Processed_Data/Folder0/Zero_'#location of class 0 data
training_set=LoadingDataset(data_root1,data_root2,train_ind)#train_ind is a list of file names that have to be read from data_root1 and data_root2
training_generator = DataLoader(training_set,batch_size =2,num_workers=4)
for epoch in range(num_epochs):
model.train()#Setting the model to train mode after eval mode to train for next epoch once the testing for that epoch is finished
for i, (inputs, targets) in enumerate(train_loader):
.
.
.
.
` I get this error when the run this code
RuntimeError: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 68, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 68, in return [default_collate(samples) for samples in transposed] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py", line 43, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 96596 and 25060 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:711
My Questions are 1.Have I Implemented this correctly, is this how you pre-process and then train a dataset batch-wise?
2.The batch_size of DataLoader and batch_size of the LSTM are different since the batch_size of DataLoader refers to the no. of files, whereas batch_size of the LSTM model refers to the no. of instances, so will I get another error here?
3.I have no idea how to scale this data-set since the MinMaxScaler has to be applied to the dataset in its entirety.
Responses are appreciated.Please let me know if I have to create separate posts for each question.
Thank You.
回答1:
Here's a summary of how pytorch does things :
- You have a
dataset
, that is an object with a__len__
method and a__getitem__
method. - You create a
dataloader
from thatdataset
and acollate_fn
- You iterate through the
dataloader
and pass a batch of data to your model.
So basically your training loop will look like
for x, y in dataloader:
output = model(x)
...
or
for x, y in dataloader:
output = model(*x)
...
if your model forward
method takes multiple arguments.
So how does this work ?
Basically you have a generator of batch indices batch_sampler
and here's what looping inside your dataloader will act like.
for indices in batch_sampler:
yield collate_fn([dataset[i] for i in indices])
So if you want everything to work well, you must look at the forward
method of your model and see how many arguments it takes (In my experience forward method of LSTM can have multiple arguments), and make sure that you use a collate_fn
to pass those correctly.
来源:https://stackoverflow.com/questions/57644452/loading-a-huge-dataset-batch-wise-to-train-pytorch