TFRecords and record shuffling

前端 未结 2 1717
一整个雨季
一整个雨季 2021-02-19 07:15

My understanding is that it is good practice to shuffle training samples for each epoch so that each mini-batch contains a nice random sample of the entire dataset. If I convert

2条回答
  •  悲&欢浪女
    2021-02-19 07:49

    It's not - you can improve the mixing somewhat by sharding your input into multiple input data files, and then treating them as explained in this answer.

    If you need anything close to "perfect" shuffling, you would need to read it into memory, but in practice for most things, you'll probably get "good enough" shuffling by just splitting into 100 or 1000 files and then using a shuffle queue that's big enough to hold 8-16 files worth of data.

    I have an itch in the back of my head to write an external random shuffle queue that can spill to disk, but it's very low on my priority list -- if someone wanted to contribute one, I'm volunteering to review it. :)

提交回复
热议问题