I try to optimize my data input pipeline. The dataset is a set of 450 TFRecord files of size ~70MB each, hosted on GCS. The job is executed with GCP ML Engine. There is no GPU.<
I have a further suggestion to add:
According to the documentation of interleave(), you can as the first parameter use a mapping function.
This means, one can write:
dataset = tf.data.Dataset.list_files(file_pattern)
dataset = dataset.interleave(lambda x:
tf.data.TFRecordDataset(x).map(parse_fn, num_parallel_calls=AUTOTUNE),
cycle_length=tf.data.experimental.AUTOTUNE,
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
As I understand it, this maps a parsing function to each shard, and then interleaves the results. This then eliminates the use of dataset.map(...)
later on.