Tensorflow Dataset using many compressed numpy files

问题

I have a large dataset that I would like to use for training in Tensorflow.

The data is stored in compressed numpy format (using numpy.savez_compressed). There are variable numbers of images per file due to the way they are produced.

Currently I use a Keras Sequence based generator object to train, but I'd like to move entirely to Tensorflow without Keras.

I'm looking at the Dataset API on the TF website, but it is not obvious how I might use this to read numpy data.

My first idea was this

import glob
import tensorflow as tf
import numpy as np

def get_data_from_filename(filename):
   npdata = np.load(open(filename))
   return npdata['features'],npdata['labels']

# get files
filelist = glob.glob('*.npz')

# create dataset of filenames
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_from_filename)

However, this passes a TF Tensor placeholder to a real numpy function and numpy is expecting a standard string. This results in the error:

File "test.py", line 6, in get_data_from_filename
   npdata = np.load(open(filename))
TypeError: coercing to Unicode: need string or buffer, Tensor found

The other option I'm considering (but seems messy) is to create a Dataset object built on TF placeholders which I then fill during my epoch-batch loop from my numpy files.

Any suggestions?

回答1:

You can define a wrapper and use pyfunc like this:

def get_data_from_filename(filename):
   npdata = np.load(filename)
   return npdata['features'], npdata['labels']

def get_data_wrapper(filename):
   # Assuming here that both your data and label is float type.
   features, labels = tf.py_func(
       get_data_from_filename, [filename], (tf.float32, tf.float32)) 
   return tf.data.Dataset.from_tensor_slices((features, labels))

# Create dataset of filenames.
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_wrapper)

If your dataset is very large and you have memory issues, you can consider using a combination of interleave or parallel_interleave and from_generator methods instead. The from_generator method uses py_func internally so you can directly read your np file and then define your generator in python.

来源：https://stackoverflow.com/questions/53544809/tensorflow-dataset-using-many-compressed-numpy-files

标签

python

numpy

tensorflow

dataset

data-handling