Multiple Data flows vs all Transformations in one

后端 未结 1 484
轻奢々
轻奢々 2021-01-29 04:54

Hi I am new to Azure data factory and not all familiar with the back-end processing that run behind the scenes. I am wondering if there is a performance impact to running couple

相关标签:
1条回答
  • 2021-01-29 05:39

    1: If you execute data flows in a pipeline in parallel, ADF will spin-up separate Spark clusters for each based on the settings in your Azure Integration Runtime attached to each activity.

    2: If you put all of your logic inside a single data flow, then it will all execute in that same job execution context on a single Spark cluster instance.

    3: Another option is to execute the activities in serial in the pipeline. If you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) but you will still a brand-new Spark context for each execution.

    All are valid practices and which one you choose should be driven by your requirements for your ETL process.

    No. 3 will likely take the longest time to execute end-to-end. But it does provide a clean separation of operations in each data flow step.

    No. 2 could be more difficult to follow logically and doesn't give you much re-usability.

    No. 1 is really similar to #3, but you run them all in parallel. Of course, not every end-to-end process can run in parallel. You may require a data flow to finish before starting the next, in which case you're back in #3 serial mode.

    0 讨论(0)
提交回复
热议问题