How to implement multi-threaded import of numpy arrays stored on disk as dataset in Tensorflow

不问归期 提交于 2020-06-17 03:43:23

问题


The input and labels of my dataset is stored in 10000 .npy files each. For example inputs/0000.npy,...inputs/9999.npy and labels/0000.npy,...labels/9999.npy. While each file independently can be stored in memory, the whole dataset of 20k arrays cannot be stored in memory. I would like to implement multi-threaded CPU pipeline to import the dataset as batches of say batch_size=8.

I have tried to implement the functions mentioned in the new Tensorflow data API but haven't found any example for my requirements. All examples seem to be for cases where the whole dataset can be loaded into RAM. Any idea how to approach this?


回答1:


I would use tf.data.Dataset.from_generator() which allows you to use Tensorflow data API through a custom python generator function. This way, you can load each .npy file iteratively, having only one numpy.ndarray loaded in memory at once. Assuming that each loaded numpy.ndarray is a single instance, an example code for your case might be something as following:

import tensorflow as tf
import numpy as np
import os


def gen():
    inputs_path = ""
    labels_path = ""
    for input_file, label_file in zip(os.listdir(inputs_path), os.listdir(labels_path)):
        x = np.load(os.path.join(inputs_path, input_file))
        y = np.load(os.path.join(labels_path, label_file))
        yield x, y


INPUT_SHAPE = []
LABEL_SHAPE = []

# Input pipeline
ds = tf.data.Dataset.from_generator(
    gen, (tf.float32, tf.int64), (tf.TensorShape(INPUT_SHAPE), tf.TensorShape(LABEL_SHAPE)))
ds = ds.batch(8)
ds_iter = ds.make_initializable_iterator()
inputs_batch, labels_batch = ds_iter.get_next()

I have not tested the code. Hope it helps!



来源:https://stackoverflow.com/questions/50216747/how-to-implement-multi-threaded-import-of-numpy-arrays-stored-on-disk-as-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!