问题
The input and labels of my dataset is stored in 10000 .npy
files each. For example inputs/0000.npy,...inputs/9999.npy
and labels/0000.npy,...labels/9999.npy
. While each file independently can be stored in memory, the whole dataset of 20k arrays cannot be stored in memory. I would like to implement multi-threaded CPU pipeline to import the dataset as batches of say batch_size=8
.
I have tried to implement the functions mentioned in the new Tensorflow data API but haven't found any example for my requirements. All examples seem to be for cases where the whole dataset can be loaded into RAM. Any idea how to approach this?
回答1:
I would use tf.data.Dataset.from_generator() which allows you to use Tensorflow data API through a custom python generator function. This way, you can load each .npy
file iteratively, having only one numpy.ndarray
loaded in memory at once. Assuming that each loaded numpy.ndarray
is a single instance, an example code for your case might be something as following:
import tensorflow as tf
import numpy as np
import os
def gen():
inputs_path = ""
labels_path = ""
for input_file, label_file in zip(os.listdir(inputs_path), os.listdir(labels_path)):
x = np.load(os.path.join(inputs_path, input_file))
y = np.load(os.path.join(labels_path, label_file))
yield x, y
INPUT_SHAPE = []
LABEL_SHAPE = []
# Input pipeline
ds = tf.data.Dataset.from_generator(
gen, (tf.float32, tf.int64), (tf.TensorShape(INPUT_SHAPE), tf.TensorShape(LABEL_SHAPE)))
ds = ds.batch(8)
ds_iter = ds.make_initializable_iterator()
inputs_batch, labels_batch = ds_iter.get_next()
I have not tested the code. Hope it helps!
来源:https://stackoverflow.com/questions/50216747/how-to-implement-multi-threaded-import-of-numpy-arrays-stored-on-disk-as-dataset