Azure Data Factory - Multiple activities in Pipeline execution order

谁都会走 提交于 2019-12-10 22:25:14

问题


I have 2 blob files to copy to Azure SQL tables. My pipeline with two activities:

{
    "name": "NutrientDataBlobToAzureSqlPipeline",
    "properties": {
        "description": "Copy nutrient data from Azure BLOB to Azure SQL",
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "SqlSink",
                        "writeBatchSize": 10000,
                        "writeBatchTimeout": "60.00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "FoodGroupDescriptionsAzureBlob"
                    }
                ],
                "outputs": [
                    {
                        "name": "FoodGroupDescriptionsSQLAzure"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst"
                },
                "scheduler": {
                    "frequency": "Minute",
                    "interval": 15
                },
                "name": "FoodGroupDescriptions",
                "description": "#1 Bulk Import FoodGroupDescriptions"
            },
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "SqlSink",
                        "writeBatchSize": 10000,
                        "writeBatchTimeout": "60.00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "FoodDescriptionsAzureBlob"
                    }
                ],
                "outputs": [
                    {
                        "name": "FoodDescriptionsSQLAzure"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst"
                },
                "scheduler": {
                    "frequency": "Minute",
                    "interval": 15
                },
                "name": "FoodDescriptions",
                "description": "#2 Bulk Import FoodDescriptions"
            }
        ],
        "start": "2015-07-14T00:00:00Z",
        "end": "2015-07-14T00:00:00Z",
        "isPaused": false,
        "hubName": "gymappdatafactory_hub",
        "pipelineMode": "Scheduled"
    }
}

As I understand, once first activity is done, second starts. How do you then execute this pipeline, instead of going to Dataset slices and run manually? Also pipelineMode how can I set up to OneTime only, instead of Scheduled?


回答1:


In order to have activities run synchronously (ordered) the output of the first pipeline will need to be an input of the second pipeline.

{
"name": "NutrientDataBlobToAzureSqlPipeline",
"properties": {
    "description": "Copy nutrient data from Azure BLOB to Azure SQL",
    "activities": [
        {
            "type": "Copy",
            "typeProperties": {
                "source": {
                    "type": "BlobSource"
                },
                "sink": {
                    "type": "SqlSink",
                    "writeBatchSize": 10000,
                    "writeBatchTimeout": "60.00:00:00"
                }
            },
            "inputs": [
                {
                    "name": "FoodGroupDescriptionsAzureBlob"
                }
            ],
            "outputs": [
                {
                    "name": "FoodGroupDescriptionsSQLAzureFirst"
                }
            ],
            "policy": {
                "timeout": "01:00:00",
                "concurrency": 1,
                "executionPriorityOrder": "NewestFirst"
            },
            "scheduler": {
                "frequency": "Minute",
                "interval": 15
            },
            "name": "FoodGroupDescriptions",
            "description": "#1 Bulk Import FoodGroupDescriptions"
        },
        {
            "type": "Copy",
            "typeProperties": {
                "source": {
                    "type": "BlobSource"
                },
                "sink": {
                    "type": "SqlSink",
                    "writeBatchSize": 10000,
                    "writeBatchTimeout": "60.00:00:00"
                }
            },
            "inputs": [
                {
                    "name": "FoodGroupDescriptionsSQLAzureFirst",
                    "name": "FoodDescriptionsAzureBlob"
                }
            ],
            "outputs": [
                {
                    "name": "FoodDescriptionsSQLAzureSecond"
                }
            ],
            "policy": {
                "timeout": "01:00:00",
                "concurrency": 1,
                "executionPriorityOrder": "NewestFirst"
            },
            "scheduler": {
                "frequency": "Minute",
                "interval": 15
            },
            "name": "FoodDescriptions",
            "description": "#2 Bulk Import FoodDescriptions"
        }
    ],
    "start": "2015-07-14T00:00:00Z",
    "end": "2015-07-14T00:00:00Z",
    "isPaused": false,
    "hubName": "gymappdatafactory_hub",
    "pipelineMode": "Scheduled"
}

If you notice the output of the first activity "FoodGroupDescriptionsSQLAzureFirst" becomes an input in the second activity.




回答2:


If I understand correctly you want to execute both the activities without manually executing the dataset slices.

You can do that simply by defining the dataset as external.

As an example

{
    "name": "FoodGroupDescriptionsAzureBlob",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": "AzureBlobStore",
        "typeProperties": {
            "folderPath": "mycontainer/folder",
            "format": {
                "type": "TextFormat",
                "rowDelimiter": "\n",
                "columnDelimiter": "|"
            }
        },
        "external": true,
        "availability": {
            "frequency": "Day",
            "interval": 1
        }
    }
}

Observe that the property external is marked as true. This will move the data set in a ready state automatically. Sadly there is no was to mark the pipeline as run once. After running the pipeline once you have an option of setting the isPaused property to true to prevent further executions.

Note: external property can be set to true only for input data sets. All the activities which have input data set which is marked as external will be executed in parallel.



来源:https://stackoverflow.com/questions/35970079/azure-data-factory-multiple-activities-in-pipeline-execution-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!