Loading large data into TensorFlow 2.0 without loading it on the RAM

匆匆过客 提交于 2021-02-08 08:30:31

问题


I have processed and saved a large dataset of video and audio file (about 8 to 9 GB of data) The data is saved as 2 numpy arrays, one for each modality Shapes of the files are (number_of_examples, maximum_time_length, feature_length)

I want to use this data for training my Neural Network for a classification task I am using the TensorFlow 2.0 Beta version I am running all the codes on Google Colab (after installing tf-2.0 beta) Each time I loading the data in tf.data the entire RAM of the Virtual Machine is used and the session is forced to restart.

Previous Approaches:

I tried 2 approaches

1) Loading both the variables entirely into the RAM and converting it to tensors

2) Loading the data as memory mapped array(from disk) and load that to tf.data

However both approaches loaded the RAM and forced the VM to restart

Code:

# Access the Audio memory from disk without loading
X_audio = np.memmap('gdrive/My Drive/Codes/audio_data.npy', dtype='float32', mode='r').reshape(2198,3860,74)

# Access the Video memory from disk without loading
X_video = np.memmap('gdrive/My Drive/Codes/video_data.npy', dtype='float32', mode='r').reshape(2198,1158,711)

# Load labels
with open('gdrive/My Drive/Codes/label_data_3','rb') as f:
    Y = pkl.load(f)

dataset = tf.data.Dataset.from_tensor_slices((X_audio, X_video, Y)).shuffle(2198).batch(32)

Error : Your session crashed after using all available RAM


回答1:


With tensorflow 2.x.x dataset API you can use tf.data.Dataset.from_generator to create dataset from generator function. This generator function will do the job reading via numpy memap.

The below code creates a dummy data file then reads one example at a time from the file on the disk. It can easily be updated to read multiple examples to increase IO-throughput (let me know if you want that in the code example below).

# imports
import numpy as np
import pathlib
import tensorflow as tf

# create huge numpy array and save it to disk
file = pathlib.Path("huge_data.npy")
examples = 5000
example_shape = (256, 256)
huge_data_shape = (examples, *example_shape)
huge_data_dtype = np.float64

# create file if does not exist
if not file.is_file():
    print("creating file with random data and saving to disk")
    numpy_data = np.random.rand(*huge_data_shape).astype(huge_data_dtype)
    np.save(file, numpy_data)

# memmap the file
numpy_data_memmap = np.load(file, mmap_mode='r')


# generator function
def data_generator():
    return iter(numpy_data_memmap)


# create tf dataset from generator fn
dataset = tf.data.Dataset.from_generator(
    generator=data_generator,
    output_types=huge_data_dtype,
    output_shapes=example_shape,
)

# consume huge dataset
for i, ex in enumerate(dataset):
    print(i, ex.shape, ex.dtype)

Output:

0 (256, 256) <dtype: 'float64'>
1 (256, 256) <dtype: 'float64'>
2 (256, 256) <dtype: 'float64'>
3 (256, 256) <dtype: 'float64'>
...
4995 (256, 256) <dtype: 'float64'>
4996 (256, 256) <dtype: 'float64'>
4997 (256, 256) <dtype: 'float64'>
4998 (256, 256) <dtype: 'float64'>
4999 (256, 256) <dtype: 'float64'>



回答2:


You should probably use the HDF5 file format, which is a good way to store multidimensional arrays on you hard drive. Specifically, I recommend that you use the h5py package, which provides a seamless interface for using HDF5 files in Python.

Now, I haven't used TensorFlow 2, but in TF1, we could create TensorFlow dataset objects from a Python generator. Below, we have a generator that will load a HDF5 file and extract a random element from the array (along the first axis).

import h5py
import random

def iterate_dataset(dataset_file, dataset_name):
    h5 = h5py.File(dataset_file, 'r')
    idxs = range(len(h5[dataset_name]))
    random.shuffle(idxs)

    for i in idxs:
        yield h5[dataset_name][i]
    h5.close()

Here's also code to save your arrays as a HDF5 file

import h5py

def save_array(arr, dataset_file, dataset_name, compress=True)
    with h5py.File(dataset_file, 'a') as h5:
        if compress:
            dataset = h5.create_dataset(
                dataset_name,
                data=arr,
                chunks=(1, *arr.shape[1:]),
                compression='lzf'
            )
            return
        h5[dataset_name] = arr

save_array(data1, 'filename.hdf5', 'data1')
save_array(data2, 'filename.hdf5', 'data2')

Finally, there might be some code errors so I'll read over once I'm on my computer.



来源:https://stackoverflow.com/questions/56573622/loading-large-data-into-tensorflow-2-0-without-loading-it-on-the-ram

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!