Tensorflow Dataset using many compressed numpy files

岁酱吖の 提交于 2020-06-11 09:48:29

问题


I have a large dataset that I would like to use for training in Tensorflow.

The data is stored in compressed numpy format (using numpy.savez_compressed). There are variable numbers of images per file due to the way they are produced.

Currently I use a Keras Sequence based generator object to train, but I'd like to move entirely to Tensorflow without Keras.

I'm looking at the Dataset API on the TF website, but it is not obvious how I might use this to read numpy data.

My first idea was this

import glob
import tensorflow as tf
import numpy as np

def get_data_from_filename(filename):
   npdata = np.load(open(filename))
   return npdata['features'],npdata['labels']

# get files
filelist = glob.glob('*.npz')

# create dataset of filenames
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_from_filename)

However, this passes a TF Tensor placeholder to a real numpy function and numpy is expecting a standard string. This results in the error:

File "test.py", line 6, in get_data_from_filename
   npdata = np.load(open(filename))
TypeError: coercing to Unicode: need string or buffer, Tensor found

The other option I'm considering (but seems messy) is to create a Dataset object built on TF placeholders which I then fill during my epoch-batch loop from my numpy files.

Any suggestions?


回答1:


You can define a wrapper and use pyfunc like this:

def get_data_from_filename(filename):
   npdata = np.load(filename)
   return npdata['features'], npdata['labels']

def get_data_wrapper(filename):
   # Assuming here that both your data and label is float type.
   features, labels = tf.py_func(
       get_data_from_filename, [filename], (tf.float32, tf.float32)) 
   return tf.data.Dataset.from_tensor_slices((features, labels))

# Create dataset of filenames.
ds = tf.data.Dataset.from_tensor_slices(filelist)
ds.flat_map(get_data_wrapper)

If your dataset is very large and you have memory issues, you can consider using a combination of interleave or parallel_interleave and from_generator methods instead. The from_generator method uses py_func internally so you can directly read your np file and then define your generator in python.



来源:https://stackoverflow.com/questions/53544809/tensorflow-dataset-using-many-compressed-numpy-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!