How do I store run-time data in Azure Data Factory between pipeline executions?

问题

I have been following Microsoft's tutorial to incrementally/delta load data from an SQL Server database.

It uses a watermark (timestamp) to keep track of changed rows since last time. The tutorial stores the watermark to an Azure SQL database using the "Stored Procedure" activity in the pipeline so it can be reused in the next execution.

It seems overkill to have an Azure SQL database just to store that tiny bit of meta information (my source database is read-only btw). I'd rather just store that somewhere else in Azure. Maybe in the blob storage or whatever.

In short: Is there an easy way of keeping track of this type of data or are we limited to using stored procs (or Azure Functions et al) for this?

回答1:

I had come across a very similar scenario, and from what I found you can't store any watermark information in ADF - at least not in a way that you can easily access.

In the end I just created a basic tier Azure SQL database to store my watermark / config information on a SQL server that I was already using in my pipelines.

The nice thing about this is when my solution scaled out to multiple business units, all with different databases, I could still maintain watermark information for each of them by simply adding a column that tracks which BU that specific watermark info was for.

Blob storage is indeed a cheaper option but I've found it to require a little more effort than just using an additional database / table in an existing database.

I agree it would be really useful to be able to maintain a small dataset in ADF itself for small config items - probably a good suggestion to make to Microsoft!

回答2:

There is a way to achieve this by using Copy activity, but it is complicated to get latest watermark in 'LookupOldWaterMarkActivity', just for reference.

Dataset setting:

Copy activity setting:

Source and sink dataset is the same one. Change the expression in additional columns to @{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}

Through this, you can save watermark as column in .txt file. But it is difficult to get the latest watermark with Lookup activity. Because your output of 'LookupOldWaterMarkActivity' will be like this:

{
    "count": 1,
    "value": [
        {
            "Prop_0": "11/24/2020 02:39:14",
            "Prop_1": "11/24/2020 08:31:42"
        }
    ]
}

The name of key is generated by ADF. If you want to get "11/24/2020 08:31:42", you need to get column count and then use expression like this: @activity('LookupOldWaterMarkActivity').output.value[0][Prop_(column count - 1)]

How to get latest watermark:

use GetMetadata activity to get columnCount
use this expression:@activity('LookupOldWaterMarkActivity').output.value[0][concat('Prop_',string(sub(activity('Get Metadata1').output.columnCount,1)))]

来源：https://stackoverflow.com/questions/64971514/how-do-i-store-run-time-data-in-azure-data-factory-between-pipeline-executions

标签

azure

azure-data-factory

azure-data-factory-2

azure-data-factory-pipeline