问题
I would like to input compressed pd.read_pickle(filename, compression='xz')
pandas dataframes as a pipeline to tensorflow. I want to use the high level API tf.estimator
classifier which requires an input function.
My data files are large matrices ~(1400X16) of floats, and each matrix corresponds to a particular type (label). Each type (label) is contained in a different directory, so I know the matrix label from its directory. At the low level, I know I can populate data using a feed_dict={X:batch_X:Y_:batch_Y}
to feed the data pipeline, but tf.estimator
requires an input function. For example, assuming I have two labels, my function would probably be something like this
def my_input_fn(directory,file_name):
data=pd.read_pickle(directory+file_name,compression='xz')
#sometimes I need to operate columns
data=data['col1']*data['col2']
if directory=='./dir1':
label=[1,0]
elif directory=='./dir2':
label=[0,1]
return data,label
But I am having a lot of trouble understanding how to map my input to a tensorflow graph dict and how the tf.estimator
accepts my function returns. What is the correct way to return my data and labels, so that they enter a pipeline?
回答1:
Just in case someone else has a similar problem, I will write a small version of a solution to this question.
I defined a function called "extract" and mapped this extracted numpy array as a dataset using tf.py_func
. Then another dataset is initiated from "labels" (a one-hot numpy array), and is initiated in correspondance to the filenames of the extracted pandas dataframes. The extracted dataframes and one-hot arrays are zipped with tf.data.Dataset.zip
into a final dataset. The final dataset can then be prefetched, iterated, etc.
def extract(file_name):
df=pd.read_pickle(file_name,compression='xz')
df=df.astype(dtype=float)
... #extra manipulations
return df.values.astype('float32', copy=False)
dataset1 = tf.data.Dataset.list_files(file_names)
dataset1 = dataset1.map(lambda filename: tf.py_func(extract,filename],tf.float32),num_parallel_calls=10)
dataset2 = tf.data.Dataset.from_tensor_slices(labels)
dataset = tf.data.Dataset.zip((dataset1,dataset2))
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
X,Y_= get_batch
with tf.Session() as sess:
sess.run(init)
xx,yy=sess.run([X,Y_])
来源:https://stackoverflow.com/questions/51623671/tensorflow-pipeline-for-pickled-pandas-data-input