Most efficient way to use a large data set for PyTorch?

后端 未结 3 1249
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-02-09 08:18

Perhaps this question has been asked before, but I\'m having trouble finding relevant info for my situation.

I\'m using PyTorch to create a CNN for regression with image

相关标签:
3条回答
  • 2021-02-09 08:47

    For speed I would advise to used HDF5 or LMDB:

    Reasons to use LMDB:

    LMDB uses memory-mapped files, giving much better I/O performance. Works well with really large datasets. The HDF5 files are always read entirely into memory, so you can’t have any HDF5 file exceed your memory capacity. You can easily split your data into several HDF5 files though (just put several paths to h5 files in your text file). Then again, compared to LMDB’s page caching the I/O performance won’t be nearly as good. [http://deepdish.io/2015/04/28/creating-lmdb-in-python/]

    If you decide to used LMDB:

    ml-pyxis is a tool for creating and reading deep learning datasets using LMDBs.

    It allows to create binary blobs (LMDB) and they can be read quite fast. The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos .

    This notebook has an example on how to create a dataset and read it paralley while using pytorch.

    If you decide to use HDF5:

    PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.

    https://www.pytables.org/

    0 讨论(0)
  • 2021-02-09 08:51

    Here is a concrete example to demonstrate what I meant. This assumes that you've already dumped the images into an hdf5 file (train_images.hdf5) using h5py.

    import h5py
    hf = h5py.File('train_images.hdf5', 'r')
    
    group_key = list(hf.keys())[0]
    ds = hf[group_key]
    
    # load only one example
    x = ds[0]
    
    # load a subset, slice (n examples) 
    arr = ds[:n]
    
    # should load the whole dataset into memory.
    # this should be avoided
    arr = ds[:]
    

    In simple terms, ds can now be used as an iterator which gives images on the fly (i.e. it doesn't load anything in memory). This should make the whole run time blazing fast.

    for idx, img in enumerate(ds):
       # do something with `img`
    
    0 讨论(0)
  • 2021-02-09 08:54

    In addition to the above answers, the following may be useful due to some recent advances (2020) in the Pytorch world.

    Your question: Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?

    You can leave the image files in their original format (.jpg, .png, etc.) on your local disk or on the cloud storage, but with one added step - compress the directory as a tar file. Please read this for more details:

    Pytorch Blog (Aug 2020): Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs (https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/)

    This package is designed for situations where the data files are too large to fit in memory for training. Therefore, you give the URL of the dataset location (local, cloud, ..) and it will bring in the data in batches and in parallel.

    The only (current) requirement is that the dataset must be in a tar file format.

    The tar file can be on the local disk or on the cloud. With this, you don't have to load the entire dataset into the memory every time. You can use the torch.utils.data.DataLoader to load in batches for stochastic gradient descent.

    0 讨论(0)
提交回复
热议问题