How to improve data input pipeline performance?

前端 未结 2 1363
礼貌的吻别
礼貌的吻别 2021-02-04 01:28

I try to optimize my data input pipeline. The dataset is a set of 450 TFRecord files of size ~70MB each, hosted on GCS. The job is executed with GCP ML Engine. There is no GPU.<

2条回答
  •  遇见更好的自我
    2021-02-04 02:07

    I have a further suggestion to add:

    According to the documentation of interleave(), you can as the first parameter use a mapping function.

    This means, one can write:

     dataset = tf.data.Dataset.list_files(file_pattern)
     dataset = dataset.interleave(lambda x:
        tf.data.TFRecordDataset(x).map(parse_fn, num_parallel_calls=AUTOTUNE),
        cycle_length=tf.data.experimental.AUTOTUNE,
        num_parallel_calls=tf.data.experimental.AUTOTUNE
        )
    

    As I understand it, this maps a parsing function to each shard, and then interleaves the results. This then eliminates the use of dataset.map(...) later on.

提交回复
热议问题