How to get mini-batches in pytorch in a clean and efficient way?

前端 未结 6 1444
北恋
北恋 2021-01-30 08:48

I was trying to do a simple thing which was train a linear model with Stochastic Gradient Descent (SGD) using torch:

import numpy as np

import torch
from torch.         


        
相关标签:
6条回答
  • 2021-01-30 09:17

    Create a class that is a subclass of torch.utils.data.Dataset and pass it to a torch.utils.data.Dataloader. Below is an example for my project.

    class CandidateDataset(Dataset):
        def __init__(self, x, y):
            self.len = x.shape[0]
            if torch.cuda.is_available():
                device = 'cuda'
            else:
                device = 'cpu'
            self.x_data = torch.as_tensor(x, device=device, dtype=torch.float)
            self.y_data = torch.as_tensor(y, device=device, dtype=torch.long)
    
        def __getitem__(self, index):
            return self.x_data[index], self.y_data[index]
    
        def __len__(self):
            return self.len
    
    def fit(self, candidate_count):
            feature_matrix = np.empty(shape=(candidate_count, 600))
            target_matrix = np.empty(shape=(candidate_count, 1))
            fill_matrices(feature_matrix, target_matrix)
            candidate_ds = CandidateDataset(feature_matrix, target_matrix)
            train_loader = DataLoader(dataset = candidate_ds, batch_size = self.BATCH_SIZE, shuffle = True)
            for epoch in range(self.N_EPOCHS):
                print('starting epoch ' + str(epoch))
                for batch_idx, (inputs, labels) in enumerate(train_loader):
                    print('starting batch ' + str(batch_idx) + ' epoch ' + str(epoch))
                    inputs, labels = Variable(inputs), Variable(labels)
                    self.optimizer.zero_grad()
                    inputs = inputs.view(1, inputs.size()[0], 600)
                    # init hidden with number of rows in input
                    y_pred = self.model(inputs, self.model.initHidden(inputs.size()[1]))
                    labels.squeeze_()
                    # labels should be tensor with batch_size rows. Column the index of the class (0 or 1)
                    loss = self.loss_f(y_pred, labels)
                    loss.backward()
                    self.optimizer.step()
                    print('done batch ' + str(batch_idx) + ' epoch ' + str(epoch))
    
    0 讨论(0)
  • 2021-01-30 09:23

    Not sure what you were trying to do. W.r.t. batching you wouldn't have to convert to numpy. You could just use index_select() , e.g.:

    for epoch in range(500):
        k=0
        loss = 0
        while k < X_mdl.size(0):
    
            random_batch = [0]*5
            for i in range(k,k+M):
                random_batch[i] = np.random.choice(N-1)
            random_batch = torch.LongTensor(random_batch)
            batch_xs = X_mdl.index_select(0, random_batch)
            batch_ys = y.index_select(0, random_batch)
    
            # Forward pass: compute predicted y using operations on Variables
            y_pred = batch_xs.mul(W)
            # etc..
    

    The rest of the code would have to be changed as well though.


    My guess, you would like to create a get_batch function that concatenates your X tensors and Y tensors. Something like:

    def make_batch(list_of_tensors):
        X, y = list_of_tensors[0]
        # may need to unsqueeze X and y to get right dimensions
        for i, (sample, label) in enumerate(list_of_tensors[1:]):
            X = torch.cat((X, sample), dim=0)
            y = torch.cat((y, label), dim=0)
        return X, y
    

    Then during training you select, e.g. max_batch_size = 32, examples through slicing.

    for epoch:
      X, y = make_batch(list_of_tensors)
      X = Variable(X, requires_grad=False)
      y = Variable(y, requires_grad=False)
    
      k = 0   
       while k < X.size(0):
         inputs = X[k:k+max_batch_size,:]
         labels = y[k:k+max_batch_size,:]
         # some computation
         k+= max_batch_size
    
    0 讨论(0)
  • 2021-01-30 09:25

    If I'm understanding your code correctly, your get_batch2 function appears to be taking random mini-batches from your dataset without tracking which indices you've used already in an epoch. The issue with this implementation is that it likely will not make use of all of your data.

    The way I usually do batching is creating a random permutation of all the possible vertices using torch.randperm(N) and loop through them in batches. For example:

    n_epochs = 100 # or whatever
    batch_size = 128 # or whatever
    
    for epoch in range(n_epochs):
    
        # X is a torch Variable
        permutation = torch.randperm(X.size()[0])
    
        for i in range(0,X.size()[0], batch_size):
            optimizer.zero_grad()
    
            indices = permutation[i:i+batch_size]
            batch_x, batch_y = X[indices], Y[indices]
    
            # in case you wanted a semi-full example
            outputs = model.forward(batch_x)
            loss = lossfunction(outputs,batch_y)
    
            loss.backward()
            optimizer.step()
    

    If you like to copy and paste, make sure you define your optimizer, model, and lossfunction somewhere before the start of the epoch loop.

    With regards to your error, try using torch.from_numpy(np.random.randint(0,N,size=M)).long() instead of torch.LongTensor(np.random.randint(0,N,size=M)). I'm not sure if this will solve the error you are getting, but it will solve a future error.

    0 讨论(0)
  • 2021-01-30 09:25

    Use data loaders.

    Data Set

    First you define a dataset. You can use packages datasets in torchvision.datasets or use ImageFolder dataset class which follows the structure of Imagenet.

    trainset=torchvision.datasets.ImageFolder(root='/path/to/your/data/trn', transform=generic_transform)
    testset=torchvision.datasets.ImageFolder(root='/path/to/your/data/val', transform=generic_transform)
    

    Transforms

    Transforms are very useful for preprocessing loaded data on the fly. If you are using images, you have to use the ToTensor() transform to convert loaded images from PIL to torch.tensor. More transforms can be packed into a composit transform as follows.

    generic_transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.ToPILImage(),
        #transforms.CenterCrop(size=128),
        transforms.Lambda(lambda x: myimresize(x, (128, 128))),
        transforms.ToTensor(),
        transforms.Normalize((0., 0., 0.), (6, 6, 6))
    ])
    

    Data Loader

    Then you define a data loader which prepares the next batch while training. You can set number of threads for data loading.

    trainloader=torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=8)
    testloader=torch.utils.data.DataLoader(testset, batch_size=32, shuffle=False, num_workers=8)
    

    For training, you just enumerate on the data loader.

      for i, data in enumerate(trainloader, 0):
        inputs, labels = data    
        inputs, labels = Variable(inputs.cuda()), Variable(labels.cuda())
        # continue training...
    

    NumPy Stuff

    Yes. You have to convert torch.tensor to numpy using .numpy() method to work on it. If you are using CUDA you have to download the data from GPU to CPU first using the .cpu() method before calling .numpy(). Personally, coming from MATLAB background, I prefer to do most of the work with torch tensor, then convert data to numpy only for visualisation. Also bear in mind that torch stores data in a channel-first mode while numpy and PIL work with channel-last. This means you need to use np.rollaxis to move the channel axis to the last. A sample code is below.

    np.rollaxis(make_grid(mynet.ftrextractor(inputs).data, nrow=8, padding=1).cpu().numpy(), 0, 3)
    

    Logging

    The best method I found to visualise the feature maps is using tensor board. A code is available at yunjey/pytorch-tutorial.

    0 讨论(0)
  • 2021-01-30 09:33

    An alternative could be using pd.DataFrame.sample

    train = pd.read_csv(TrainSetPath)
    test = pd.read_csv(TestSetPath)
    
    # use df.sample() to shuffle the data frame 
    train = train.sample(frac=1)
    test = test.sample(frac=1)
    
    for i in range(epochs):
            for j in range(batch_per_epoch):
                train_batch = train.sample(n=BatchSize, axis='index',replace=True)
                y_train = train_batch['Target']
                X_train = train_batch.drop(['Target'], axis=1)
                
                # convert data frames to tensors and send them to GPU (if used)
                X_train = torch.tensor(np.mat(X_train)).float().to(device)
                y_train = torch.tensor(np.mat(y_train)).float().to(device)
    
    0 讨论(0)
  • 2021-01-30 09:39

    You can use torch.utils.data

    assuming you have loaded the data from the directory, in train and test numpy arrays, you can inherit from torch.utils.data.Dataset class to create your dataset object

    class MyDataset(Dataset):
        def __init__(self, x, y):
            super(MyDataset, self).__init__()
            assert x.shape[0] == y.shape[0] # assuming shape[0] = dataset size
            self.x = x
            self.y = y
    
    
        def __len__(self):
            return self.y.shape[0]
    
        def __getitem__(self, index):
            return self.x[index], self.y[index]
    

    Then, create your dataset object

    traindata = MyDataset(train_x, train_y)
    

    Finally, use DataLoader to create your mini-batches

    trainloader = torch.utils.data.DataLoader(traindata, batch_size=64, shuffle=True)
    
    0 讨论(0)
提交回复
热议问题