Google Dataflow Pipeline with Instance Local Cache + External REST API calls

前端 未结 1 732
半阙折子戏
半阙折子戏 2020-12-16 06:03

We want to build a Cloud Dataflow Streaming pipeline which ingests events from Pubsub and performs multiple ETL-like operations on each individual event. One of these operat

相关标签:
1条回答
  • 2020-12-16 06:21

    Here's a few things you can do:

    • Your DoFn's can have instance variables and you can put the cache there.
    • It's also ok to use regular Java static variables for a cache local to the VM, as long as you properly manage the multithreaded access to it. Guava CacheBuilder might be really helpful here.
    • Using regular Java APIs for temp files on a worker is safe (but again, be mindful of multithreaded / multiprocess access to your files, and make sure to clean them up - you may find the DoFn @Setup and @Teardown methods useful).
    • You can do a GroupByKey by the device id; then, most of the time, at least with the Cloud Dataflow runner, the same key will be processed by the same worker (though key assignments can change while the pipeline runs, but not too frequently usually). You'll probably want to set a windowing/triggering strategy with immediate triggering though.
    0 讨论(0)
提交回复
热议问题