Does TensorFlow's `sample_from_datasets` still sample from a Dataset when getting a `DirectedInterleave selected an exhausted input` warning?

问题

When using TensorFlow's tf.data.experimental.sample_from_datasets to equally sample from two very unbalanced Datasets, I end up getting a DirectedInterleave selected an exhausted input: 0 warning. Based on this GitHub issue, it appears that this is occurring when one of the Datasets inside the sample_from_datasets has been depleted of examples, and would need to sample already seen examples.

Does the depleted dataset then still produce samples (thereby maintaining the desired balanced training ratio), or does the dataset not sample so the training once again becomes unbalanced? If the latter, is there a method to produce the desired balanced training ratio with sample_from_datasets?

Note: TensorFlow 2 Beta is being used

回答1:

The smaller dataset does NOT repeat - once it is exhausted the remainder will just come from the larger dataset that still has examples.

You can verify this behaviour by doing something like this:

def data1():
  for i in range(5):
    yield "data1-{}".format(i)

def data2():
  for i in range(10000):
    yield "data2-{}".format(i)

ds1 = tf.data.Dataset.from_generator(data1, tf.string)
ds2 = tf.data.Dataset.from_generator(data2, tf.string)

sampled_ds = tf.data.experimental.sample_from_datasets([ds2, ds1], seed=1)

then if we iterate over sampled_ds we see that no samples from data1 are produced once it is exhausted:

tf.Tensor(b'data1-0', shape=(), dtype=string)
tf.Tensor(b'data2-0', shape=(), dtype=string)
tf.Tensor(b'data2-1', shape=(), dtype=string)
tf.Tensor(b'data2-2', shape=(), dtype=string)
tf.Tensor(b'data2-3', shape=(), dtype=string)
tf.Tensor(b'data2-4', shape=(), dtype=string)
tf.Tensor(b'data1-1', shape=(), dtype=string)
tf.Tensor(b'data1-2', shape=(), dtype=string)
tf.Tensor(b'data1-3', shape=(), dtype=string)
tf.Tensor(b'data2-5', shape=(), dtype=string)
tf.Tensor(b'data1-4', shape=(), dtype=string)
tf.Tensor(b'data2-6', shape=(), dtype=string)
tf.Tensor(b'data2-7', shape=(), dtype=string)
tf.Tensor(b'data2-8', shape=(), dtype=string)
tf.Tensor(b'data2-9', shape=(), dtype=string)
tf.Tensor(b'data2-10', shape=(), dtype=string)
tf.Tensor(b'data2-11', shape=(), dtype=string)
tf.Tensor(b'data2-12', shape=(), dtype=string)
...
---[no more 'data1-x' examples]--
...

Of course, you could make data1 repeat with something like this:

sampled_ds = tf.data.experimental.sample_from_datasets([ds2, ds1.repeat()], seed=1)

but it seems from comments that you are aware of this and it doesn't work for your scenario.

If the latter, is there a method to produce the desired balanced training ratio with sample_from_datasets?

Well, if you have 2 datasets of differing lengths and you are sampling evenly from then it seems like you only have 2 choices:

repeat the smaller dataset n times (where n ≃ len(ds2)/len(ds1))
stop sampling once the smaller dataset is exhausted

To achieve the first you can use ds1.repeat(n).

To achieve the second you could use ds2.take(m) where m=len(ds1).

来源：https://stackoverflow.com/questions/57278214/does-tensorflows-sample-from-datasets-still-sample-from-a-dataset-when-gettin

标签

python

tensorflow

tensorflow-datasets

tensorflow2.0