My understanding is that it is good practice to shuffle training samples for each epoch so that each mini-batch contains a nice random sample of the entire dataset. If I convert
It's not - you can improve the mixing somewhat by sharding your input into multiple input data files, and then treating them as explained in this answer.
If you need anything close to "perfect" shuffling, you would need to read it into memory, but in practice for most things, you'll probably get "good enough" shuffling by just splitting into 100 or 1000 files and then using a shuffle queue that's big enough to hold 8-16 files worth of data.
I have an itch in the back of my head to write an external random shuffle queue that can spill to disk, but it's very low on my priority list -- if someone wanted to contribute one, I'm volunteering to review it. :)