问题

I have processed and saved a large dataset of video and audio file (about 8 to 9 GB of data) The data is saved as 2 numpy arrays, one for each modality Shapes of the files are (number_of_examples, maximum_time_length, feature_length)

I want to use this data for training my Neural Network for a classification task I am using the TensorFlow 2.0 Beta version I am running all the codes on Google Colab (after installing tf-2.0 beta) Each time I loading the data in tf.data the entire RAM of the Virtual Machine is used and the session is forced to restart.

Previous Approaches:

I tried 2 approaches

1) Loading both the variables entirely into the RAM and converting it to tensors

2) Loading the data as memory mapped array(from disk) and load that to tf.data

However both approaches loaded the RAM and forced the VM to restart

Code:

# Access the Audio memory from disk without loading
X_audio = np.memmap('gdrive/My Drive/Codes/audio_data.npy', dtype='float32', mode='r').reshape(2198,3860,74)

# Access the Video memory from disk without loading
X_video = np.memmap('gdrive/My Drive/Codes/video_data.npy', dtype='float32', mode='r').reshape(2198,1158,711)

# Load labels
with open('gdrive/My Drive/Codes/label_data_3','rb') as f:
    Y = pkl.load(f)

dataset = tf.data.Dataset.from_tensor_slices((X_audio, X_video, Y)).shuffle(2198).batch(32)

Error : Your session crashed after using all available RAM

回答1:

With tensorflow 2.x.x dataset API you can use tf.data.Dataset.from_generator to create dataset from generator function. This generator function will do the job reading via numpy memap.

The below code creates a dummy data file then reads one example at a time from the file on the disk. It can easily be updated to read multiple examples to increase IO-throughput (let me know if you want that in the code example below).

# imports
import numpy as np
import pathlib
import tensorflow as tf

# create huge numpy array and save it to disk
file = pathlib.Path("huge_data.npy")
examples = 5000
example_shape = (256, 256)
huge_data_shape = (examples, *example_shape)
huge_data_dtype = np.float64

# create file if does not exist
if not file.is_file():
    print("creating file with random data and saving to disk")
    numpy_data = np.random.rand(*huge_data_shape).astype(huge_data_dtype)
    np.save(file, numpy_data)

# memmap the file
numpy_data_memmap = np.load(file, mmap_mode='r')


# generator function
def data_generator():
    return iter(numpy_data_memmap)


# create tf dataset from generator fn
dataset = tf.data.Dataset.from_generator(
    generator=data_generator,
    output_types=huge_data_dtype,
    output_shapes=example_shape,
)

# consume huge dataset
for i, ex in enumerate(dataset):
    print(i, ex.shape, ex.dtype)

Output:

0 (256, 256) <dtype: 'float64'>
1 (256, 256) <dtype: 'float64'>
2 (256, 256) <dtype: 'float64'>
3 (256, 256) <dtype: 'float64'>
...
4995 (256, 256) <dtype: 'float64'>
4996 (256, 256) <dtype: 'float64'>
4997 (256, 256) <dtype: 'float64'>
4998 (256, 256) <dtype: 'float64'>
4999 (256, 256) <dtype: 'float64'>

回答2:

You should probably use the HDF5 file format, which is a good way to store multidimensional arrays on you hard drive. Specifically, I recommend that you use the h5py package, which provides a seamless interface for using HDF5 files in Python.

Now, I haven't used TensorFlow 2, but in TF1, we could create TensorFlow dataset objects from a Python generator. Below, we have a generator that will load a HDF5 file and extract a random element from the array (along the first axis).

import h5py
import random

def iterate_dataset(dataset_file, dataset_name):
    h5 = h5py.File(dataset_file, 'r')
    idxs = range(len(h5[dataset_name]))
    random.shuffle(idxs)

    for i in idxs:
        yield h5[dataset_name][i]
    h5.close()

Here's also code to save your arrays as a HDF5 file

import h5py

def save_array(arr, dataset_file, dataset_name, compress=True)
    with h5py.File(dataset_file, 'a') as h5:
        if compress:
            dataset = h5.create_dataset(
                dataset_name,
                data=arr,
                chunks=(1, *arr.shape[1:]),
                compression='lzf'
            )
            return
        h5[dataset_name] = arr

save_array(data1, 'filename.hdf5', 'data1')
save_array(data2, 'filename.hdf5', 'data2')

Finally, there might be some code errors so I'll read over once I'm on my computer.

来源：https://stackoverflow.com/questions/56573622/loading-large-data-into-tensorflow-2-0-without-loading-it-on-the-ram

标签

python