How to improve data input pipeline performance?

前端未结

关注

 2  1363

礼貌的吻别 2021-02-04 01:28

I try to optimize my data input pipeline. The dataset is a set of 450 TFRecord files of size ~70MB each, hosted on GCS. The job is executed with GCP ML Engine. There is no GPU.<

2条回答

遇见更好的自我 (楼主)

2021-02-04 02:07
I have a further suggestion to add:

According to the documentation of interleave(), you can as the first parameter use a mapping function.

This means, one can write:
```
 dataset = tf.data.Dataset.list_files(file_pattern)
 dataset = dataset.interleave(lambda x:
    tf.data.TFRecordDataset(x).map(parse_fn, num_parallel_calls=AUTOTUNE),
    cycle_length=tf.data.experimental.AUTOTUNE,
    num_parallel_calls=tf.data.experimental.AUTOTUNE
    )
```
As I understand it, this maps a parsing function to each shard, and then interleaves the results. This then eliminates the use of dataset.map(...) later on.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...