How to expand tf.data.Dataset with additional example transformations in Tensorflow

问题

I would like to double the size of an existing dataset I'm using to train a neural network in tensorflow on the fly by adding random noise to it. So when I'm done I'll have all the existing examples and also all the examples with noise added to them. I'd also like to interleave these as I transform them, so they come out in this order: example 1 without noise, example 1 with noise, example 2 without noise, example 2 with noise, etc. I'm struggling to accomplish this using the Dataset api. I've tried to use unbatch to accomplish this like so:

def generate_permutations(features, labels):
    return [
        [features, labels],
        [add_noise(features), labels]
    ]

dataset.map(generate_permutations).apply(tf.contrib.data.unbatch())

but I get an error saying Shapes must be equal rank, but are 2 and 1. I'm guessing tensorflow is trying to make a tensor out of that batch I'm returning, but features and labels are different shapes, so that doesn't work. I could probably do this by just making two datasets and concating them together, but I'm worried that would result in very skewed training where I train nicely for half the epoch and suddenly all of the data has this new transformation to it for the second half. How can I accomplish this on the fly without writing these transformations to disk before feeding into tensorflow?

回答1:

The Dataset.flat_map() transformation is the tool you need: it enables you to map a single input element into multiple elements, then flattens the result. Your code would look something like the following:

def generate_permutations(features, labels):
    regular_ds = tf.data.Dataset.from_tensors((features, labels))
    noisy_ds = tf.data.Dataset.from_tensors((add_noise(features), labels))
    return regular_ds.concatenate(noisy_ds)

dataset = dataset.flat_map(generate_permutations)

来源：https://stackoverflow.com/questions/47337031/how-to-expand-tf-data-dataset-with-additional-example-transformations-in-tensorf

标签

python

tensorflow

tensorflow-datasets