Does TensorFlow's `sample_from_datasets` still sample from a Dataset when getting a `DirectedInterleave selected an exhausted input` warning?

巧了我就是萌 提交于 2020-05-25 23:46:25

问题


When using TensorFlow's tf.data.experimental.sample_from_datasets to equally sample from two very unbalanced Datasets, I end up getting a DirectedInterleave selected an exhausted input: 0 warning. Based on this GitHub issue, it appears that this is occurring when one of the Datasets inside the sample_from_datasets has been depleted of examples, and would need to sample already seen examples.

Does the depleted dataset then still produce samples (thereby maintaining the desired balanced training ratio), or does the dataset not sample so the training once again becomes unbalanced? If the latter, is there a method to produce the desired balanced training ratio with sample_from_datasets?

Note: TensorFlow 2 Beta is being used


回答1:


The smaller dataset does NOT repeat - once it is exhausted the remainder will just come from the larger dataset that still has examples.

You can verify this behaviour by doing something like this:

def data1():
  for i in range(5):
    yield "data1-{}".format(i)

def data2():
  for i in range(10000):
    yield "data2-{}".format(i)

ds1 = tf.data.Dataset.from_generator(data1, tf.string)
ds2 = tf.data.Dataset.from_generator(data2, tf.string)

sampled_ds = tf.data.experimental.sample_from_datasets([ds2, ds1], seed=1)

then if we iterate over sampled_ds we see that no samples from data1 are produced once it is exhausted:

tf.Tensor(b'data1-0', shape=(), dtype=string)
tf.Tensor(b'data2-0', shape=(), dtype=string)
tf.Tensor(b'data2-1', shape=(), dtype=string)
tf.Tensor(b'data2-2', shape=(), dtype=string)
tf.Tensor(b'data2-3', shape=(), dtype=string)
tf.Tensor(b'data2-4', shape=(), dtype=string)
tf.Tensor(b'data1-1', shape=(), dtype=string)
tf.Tensor(b'data1-2', shape=(), dtype=string)
tf.Tensor(b'data1-3', shape=(), dtype=string)
tf.Tensor(b'data2-5', shape=(), dtype=string)
tf.Tensor(b'data1-4', shape=(), dtype=string)
tf.Tensor(b'data2-6', shape=(), dtype=string)
tf.Tensor(b'data2-7', shape=(), dtype=string)
tf.Tensor(b'data2-8', shape=(), dtype=string)
tf.Tensor(b'data2-9', shape=(), dtype=string)
tf.Tensor(b'data2-10', shape=(), dtype=string)
tf.Tensor(b'data2-11', shape=(), dtype=string)
tf.Tensor(b'data2-12', shape=(), dtype=string)
...
---[no more 'data1-x' examples]--
...

Of course, you could make data1 repeat with something like this:

sampled_ds = tf.data.experimental.sample_from_datasets([ds2, ds1.repeat()], seed=1)

but it seems from comments that you are aware of this and it doesn't work for your scenario.

If the latter, is there a method to produce the desired balanced training ratio with sample_from_datasets?

Well, if you have 2 datasets of differing lengths and you are sampling evenly from then it seems like you only have 2 choices:

  • repeat the smaller dataset n times (where n ≃ len(ds2)/len(ds1))
  • stop sampling once the smaller dataset is exhausted

To achieve the first you can use ds1.repeat(n).

To achieve the second you could use ds2.take(m) where m=len(ds1).



来源:https://stackoverflow.com/questions/57278214/does-tensorflows-sample-from-datasets-still-sample-from-a-dataset-when-gettin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!