How to correctly map a python function and then batch the Dataset in Tensorflow

问题

I wish to create a pipeline to provide non-standard files to the neural network (for example with extension *.xxx). Currently I have structured my code as follows:

1) I define a list of paths where to find training files

2) I define an instance of the tf.data.Dataset object containing these paths

3) I map to the Dataset a python function that takes each path and returns the associated numpy array (loaded from the folder on the pc); this array is a matrix with dimensions [256, 256, 192].

4) I define an initializable iterator and then use it during network training.

My doubt lies in the size of the batch I provide to the network. I would like to have batches of size 64 supplied to the network. How could I do? For example, if I use the function train_data.batch(b_size) with b_size = 1 the result is that when iterated, the iterator gives one element of shape [256, 256, 192]; what if I wanted to feed the neural net with just 64 slices of this array?

This is an extract of my code:

    with tf.name_scope('data'):
        train_filenames = tf.constant(list_of_files_train)

        train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
        train_data = train_data.map(lambda filename: tf.py_func(
            self._parse_xxx_data, [filename], [tf.float32]))

        train_data.shuffle(buffer_size=len(list_of_files_train))
        train_data.batch(b_size)

        iterator = tf.data.Iterator.from_structure(train_data.output_types, train_data.output_shapes)

        input_data = iterator.get_next()
        train_init = iterator.make_initializer(train_data)

  [...]

  with tf.Session() as sess:
      sess.run(train_init)
      _ = sess.run([self.train_op])

Thanks in advance

----------

I posted a solution to my problem in the comments below. I would still be happy to receive any comment or suggestion on possible improvements. Thank you ;)

回答1:

It's been a long time but I'll post a possible solution to batch the dataset with custom shape in TensorFlow, in case someone may need it.

The module tf.data offers the method unbatch() to unwrap the content of each dataset element. One can first unbatch and than batch again the dataset object in the desired way. Oftentimes, a good idea may also be shuffling the unbatched dataset before batching it again (so that we have random slices from random elements in each batch):

with tf.name_scope('data'):
    train_filenames = tf.constant(list_of_files_train)

    train_data = tf.data.Dataset.from_tensor_slices(train_filenames)
    train_data = train_data.map(lambda filename: tf.py_func(
        self._parse_xxx_data, [filename], [tf.float32]))

    # un-batch first, then batch the data
    train_data = train_data.apply(tf.data.experimental.unbatch())

    train_data.shuffle(buffer_size=BSIZE)
    train_data.batch(b_size)

    # [...]

回答2:

If I clearly understand you question, you can try to slice the array into the shape you want in your self._parse_xxx_data function.

来源：https://stackoverflow.com/questions/51043703/how-to-correctly-map-a-python-function-and-then-batch-the-dataset-in-tensorflow

标签

python

tensorflow

iterator

dataset