Azure Data Factory - Multiple activities in Pipeline execution order

问题

I have 2 blob files to copy to Azure SQL tables. My pipeline with two activities:

{
    "name": "NutrientDataBlobToAzureSqlPipeline",
    "properties": {
        "description": "Copy nutrient data from Azure BLOB to Azure SQL",
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "SqlSink",
                        "writeBatchSize": 10000,
                        "writeBatchTimeout": "60.00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "FoodGroupDescriptionsAzureBlob"
                    }
                ],
                "outputs": [
                    {
                        "name": "FoodGroupDescriptionsSQLAzure"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst"
                },
                "scheduler": {
                    "frequency": "Minute",
                    "interval": 15
                },
                "name": "FoodGroupDescriptions",
                "description": "#1 Bulk Import FoodGroupDescriptions"
            },
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "SqlSink",
                        "writeBatchSize": 10000,
                        "writeBatchTimeout": "60.00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "FoodDescriptionsAzureBlob"
                    }
                ],
                "outputs": [
                    {
                        "name": "FoodDescriptionsSQLAzure"
                    }
                ],
                "policy": {
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst"
                },
                "scheduler": {
                    "frequency": "Minute",
                    "interval": 15
                },
                "name": "FoodDescriptions",
                "description": "#2 Bulk Import FoodDescriptions"
            }
        ],
        "start": "2015-07-14T00:00:00Z",
        "end": "2015-07-14T00:00:00Z",
        "isPaused": false,
        "hubName": "gymappdatafactory_hub",
        "pipelineMode": "Scheduled"
    }
}

As I understand, once first activity is done, second starts. How do you then execute this pipeline, instead of going to Dataset slices and run manually? Also pipelineMode how can I set up to OneTime only, instead of Scheduled?

回答1:

In order to have activities run synchronously (ordered) the output of the first pipeline will need to be an input of the second pipeline.

{
"name": "NutrientDataBlobToAzureSqlPipeline",
"properties": {
    "description": "Copy nutrient data from Azure BLOB to Azure SQL",
    "activities": [
        {
            "type": "Copy",
            "typeProperties": {
                "source": {
                    "type": "BlobSource"
                },
                "sink": {
                    "type": "SqlSink",
                    "writeBatchSize": 10000,
                    "writeBatchTimeout": "60.00:00:00"
                }
            },
            "inputs": [
                {
                    "name": "FoodGroupDescriptionsAzureBlob"
                }
            ],
            "outputs": [
                {
                    "name": "FoodGroupDescriptionsSQLAzureFirst"
                }
            ],
            "policy": {
                "timeout": "01:00:00",
                "concurrency": 1,
                "executionPriorityOrder": "NewestFirst"
            },
            "scheduler": {
                "frequency": "Minute",
                "interval": 15
            },
            "name": "FoodGroupDescriptions",
            "description": "#1 Bulk Import FoodGroupDescriptions"
        },
        {
            "type": "Copy",
            "typeProperties": {
                "source": {
                    "type": "BlobSource"
                },
                "sink": {
                    "type": "SqlSink",
                    "writeBatchSize": 10000,
                    "writeBatchTimeout": "60.00:00:00"
                }
            },
            "inputs": [
                {
                    "name": "FoodGroupDescriptionsSQLAzureFirst",
                    "name": "FoodDescriptionsAzureBlob"
                }
            ],
            "outputs": [
                {
                    "name": "FoodDescriptionsSQLAzureSecond"
                }
            ],
            "policy": {
                "timeout": "01:00:00",
                "concurrency": 1,
                "executionPriorityOrder": "NewestFirst"
            },
            "scheduler": {
                "frequency": "Minute",
                "interval": 15
            },
            "name": "FoodDescriptions",
            "description": "#2 Bulk Import FoodDescriptions"
        }
    ],
    "start": "2015-07-14T00:00:00Z",
    "end": "2015-07-14T00:00:00Z",
    "isPaused": false,
    "hubName": "gymappdatafactory_hub",
    "pipelineMode": "Scheduled"
}

If you notice the output of the first activity "FoodGroupDescriptionsSQLAzureFirst" becomes an input in the second activity.

回答2:

If I understand correctly you want to execute both the activities without manually executing the dataset slices.

You can do that simply by defining the dataset as external.

As an example

{
    "name": "FoodGroupDescriptionsAzureBlob",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": "AzureBlobStore",
        "typeProperties": {
            "folderPath": "mycontainer/folder",
            "format": {
                "type": "TextFormat",
                "rowDelimiter": "\n",
                "columnDelimiter": "|"
            }
        },
        "external": true,
        "availability": {
            "frequency": "Day",
            "interval": 1
        }
    }
}

Observe that the property external is marked as true. This will move the data set in a ready state automatically. Sadly there is no was to mark the pipeline as run once. After running the pipeline once you have an option of setting the isPaused property to true to prevent further executions.

Note: external property can be set to true only for input data sets. All the activities which have input data set which is marked as external will be executed in parallel.

来源：https://stackoverflow.com/questions/35970079/azure-data-factory-multiple-activities-in-pipeline-execution-order

标签

azure

pipeline

azure-data-factory