tensorflow pipeline for pickled pandas data input

我的未来我决定 提交于 2019-12-24 10:16:43

问题


I would like to input compressed pd.read_pickle(filename, compression='xz') pandas dataframes as a pipeline to tensorflow. I want to use the high level API tf.estimatorclassifier which requires an input function. My data files are large matrices ~(1400X16) of floats, and each matrix corresponds to a particular type (label). Each type (label) is contained in a different directory, so I know the matrix label from its directory. At the low level, I know I can populate data using a feed_dict={X:batch_X:Y_:batch_Y}to feed the data pipeline, but tf.estimatorrequires an input function. For example, assuming I have two labels, my function would probably be something like this

def my_input_fn(directory,file_name):
       data=pd.read_pickle(directory+file_name,compression='xz')
       #sometimes I need to operate columns 
       data=data['col1']*data['col2'] 
       if directory=='./dir1':
            label=[1,0]
       elif directory=='./dir2':
            label=[0,1] 
       return data,label

But I am having a lot of trouble understanding how to map my input to a tensorflow graph dict and how the tf.estimator accepts my function returns. What is the correct way to return my data and labels, so that they enter a pipeline?


回答1:


Just in case someone else has a similar problem, I will write a small version of a solution to this question.

I defined a function called "extract" and mapped this extracted numpy array as a dataset using tf.py_func. Then another dataset is initiated from "labels" (a one-hot numpy array), and is initiated in correspondance to the filenames of the extracted pandas dataframes. The extracted dataframes and one-hot arrays are zipped with tf.data.Dataset.zip into a final dataset. The final dataset can then be prefetched, iterated, etc.

def extract(file_name):
    df=pd.read_pickle(file_name,compression='xz')
    df=df.astype(dtype=float)
    ... #extra manipulations    
    return df.values.astype('float32', copy=False)

dataset1 = tf.data.Dataset.list_files(file_names)
dataset1 = dataset1.map(lambda filename: tf.py_func(extract,filename],tf.float32),num_parallel_calls=10)
dataset2 = tf.data.Dataset.from_tensor_slices(labels)
dataset = tf.data.Dataset.zip((dataset1,dataset2))
iter = dataset.make_one_shot_iterator()
get_batch = iter.get_next()
X,Y_= get_batch

with tf.Session() as sess:
    sess.run(init)
    xx,yy=sess.run([X,Y_])


来源:https://stackoverflow.com/questions/51623671/tensorflow-pipeline-for-pickled-pandas-data-input

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!