Google Dataflow Pipeline with Instance Local Cache + External REST API calls

前端未结

关注

 1  732

We want to build a Cloud Dataflow Streaming pipeline which ingests events from Pubsub and performs multiple ETL-like operations on each individual event. One of these operat

相关标签:

1条回答

离开以前

2020-12-16 06:21
Here's a few things you can do:
- Your DoFn's can have instance variables and you can put the cache there.
- It's also ok to use regular Java static variables for a cache local to the VM, as long as you properly manage the multithreaded access to it. Guava CacheBuilder might be really helpful here.
- Using regular Java APIs for temp files on a worker is safe (but again, be mindful of multithreaded / multiprocess access to your files, and make sure to clean them up - you may find the DoFn @Setup and @Teardown methods useful).
- You can do a GroupByKey by the device id; then, most of the time, at least with the Cloud Dataflow runner, the same key will be processed by the same worker (though key assignments can change while the pipeline runs, but not too frequently usually). You'll probably want to set a windowing/triggering strategy with immediate triggering though.
0 讨论(0)
发布评论:

提交评论
- 加载中...