How to implement multi-threaded import of numpy arrays stored on disk as dataset in Tensorflow

问题

The input and labels of my dataset is stored in 10000 .npy files each. For example inputs/0000.npy,...inputs/9999.npy and labels/0000.npy,...labels/9999.npy. While each file independently can be stored in memory, the whole dataset of 20k arrays cannot be stored in memory. I would like to implement multi-threaded CPU pipeline to import the dataset as batches of say batch_size=8.

I have tried to implement the functions mentioned in the new Tensorflow data API but haven't found any example for my requirements. All examples seem to be for cases where the whole dataset can be loaded into RAM. Any idea how to approach this?

回答1:

I would use tf.data.Dataset.from_generator() which allows you to use Tensorflow data API through a custom python generator function. This way, you can load each .npy file iteratively, having only one numpy.ndarray loaded in memory at once. Assuming that each loaded numpy.ndarray is a single instance, an example code for your case might be something as following:

import tensorflow as tf
import numpy as np
import os


def gen():
    inputs_path = ""
    labels_path = ""
    for input_file, label_file in zip(os.listdir(inputs_path), os.listdir(labels_path)):
        x = np.load(os.path.join(inputs_path, input_file))
        y = np.load(os.path.join(labels_path, label_file))
        yield x, y


INPUT_SHAPE = []
LABEL_SHAPE = []

# Input pipeline
ds = tf.data.Dataset.from_generator(
    gen, (tf.float32, tf.int64), (tf.TensorShape(INPUT_SHAPE), tf.TensorShape(LABEL_SHAPE)))
ds = ds.batch(8)
ds_iter = ds.make_initializable_iterator()
inputs_batch, labels_batch = ds_iter.get_next()

I have not tested the code. Hope it helps!

来源：https://stackoverflow.com/questions/50216747/how-to-implement-multi-threaded-import-of-numpy-arrays-stored-on-disk-as-dataset

标签

python

tensorflow

tensorflow-datasets