I\'m working with tensorflow hoping to train a deep CNN to do move prediction for the game Go. The dataset I created consists of 100,000 binary data files, where each datafi
In your case it is not a problem to do some preprocessing and create one file out of all the files you have. For this type of games, where the history is not important and the position determines everything your dataset can consist just from position -> next_move
.
For a more broad case TF provides everything to allow the shuffling you want. There are two types shuffling which serve different purposes and shuffle different things:
['file1', 'file2', ..., 'filen']
this randomly selects a file from this list. If case of false, the files follow one after each other.batch_size
tensors from your queue (you will need to create a queue with tf.train.start_queue_runners
) and shuffles them.Yes - what you want is to use a combination of two things. (Note that this answer was written for TensorFlow v1, and some of the functionality has been replaced by the new tf.data pipelines; I've updated the answers to point to the v1 compat versions of things, but if you're coming to this answer for new code, please consult tf.data instead.)
First, randomly shuffle the order in which you input your datafiles, by reading from them using a tf.train.string_input_producer with shuffle=True
that feeds into whatever input method you use (if you can put your examples into tf.Example proto format, that's easy to use with parse_example
). To be very clear, you put the list of filenames in the string_input_producer
and then read them with another method such as read_file
, etc.
Second, you need to mix at a finer granularity. You can accomplish this by feeding the input examples into a tf.train.shuffle_batch node with a large capacity and large value of min_after_dequeue
. One particularly nice way is to use a shuffle_batch_join
that receives input from multiple files, so that you get a lot of mixing. Set the capacity of the batch big enough to mix well without exhausting your RAM. Tens of thousands of examples usually works pretty well.
Keep in mind that the batch functions add a QueueRunner
to the QUEUE_RUNNERS
collection, so you need to run tf.train.start_queue_runners()