Multiple Data flows vs all Transformations in one

后端未结

关注

 1  484

Hi I am new to Azure data factory and not all familiar with the back-end processing that run behind the scenes. I am wondering if there is a performance impact to running couple

相关标签:

1条回答

渐次进展

2021-01-29 05:39

1: If you execute data flows in a pipeline in parallel, ADF will spin-up separate Spark clusters for each based on the settings in your Azure Integration Runtime attached to each activity.

2: If you put all of your logic inside a single data flow, then it will all execute in that same job execution context on a single Spark cluster instance.

3: Another option is to execute the activities in serial in the pipeline. If you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) but you will still a brand-new Spark context for each execution.

All are valid practices and which one you choose should be driven by your requirements for your ETL process.

No. 3 will likely take the longest time to execute end-to-end. But it does provide a clean separation of operations in each data flow step.

No. 2 could be more difficult to follow logically and doesn't give you much re-usability.

No. 1 is really similar to #3, but you run them all in parallel. Of course, not every end-to-end process can run in parallel. You may require a data flow to finish before starting the next, in which case you're back in #3 serial mode.

0 讨论(0)
发布评论:

提交评论
- 加载中...