How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

后端 未结 4 1449
清歌不尽
清歌不尽 2021-01-21 05:12

I\'m trying to backup my Cosmos Db storage using Azure Data Factory(v2). In general, it\'s doing its job, but I want to have each doc in Cosmos collection to correspond new json

4条回答
  •  一向
    一向 (楼主)
    2021-01-21 05:40

    I also struggled a bit with this, especially getting around the size limits of the Lookup activity, since we have a LOT of data to migrate. I ended up creating a JSON file with a list of timestamps to query the Cosmos data with, then for each of those, getting the document IDs in that range, and then for each of those, getting the full document data and saving it to a path such as PartitionKey/DocumentID. Here's the pipelines I created:

    LookupTimestamps - loops through each timestamp range from a times.json file, and for each timestamp, executes the ExportFromCosmos pipeline

    {
        "name": "LookupTimestamps",
        "properties": {
            "activities": [
                {
                    "name": "LookupTimestamps",
                    "type": "Lookup",
                    "policy": {
                        "timeout": "7.00:00:00",
                        "retry": 0,
                        "retryIntervalInSeconds": 30,
                        "secureOutput": false,
                        "secureInput": false
                    },
                    "typeProperties": {
                        "source": {
                            "type": "BlobSource",
                            "recursive": false
                        },
                        "dataset": {
                            "referenceName": "BlobStorageTimestamps",
                            "type": "DatasetReference"
                        },
                        "firstRowOnly": false
                    }
                },
                {
                    "name": "ForEachTimestamp",
                    "type": "ForEach",
                    "dependsOn": [
                        {
                            "activity": "LookupTimestamps",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        }
                    ],
                    "typeProperties": {
                        "items": {
                            "value": "@activity('LookupTimestamps').output.value",
                            "type": "Expression"
                        },
                        "isSequential": false,
                        "activities": [
                            {
                                "name": "Execute Pipeline1",
                                "type": "ExecutePipeline",
                                "typeProperties": {
                                    "pipeline": {
                                        "referenceName": "ExportFromCosmos",
                                        "type": "PipelineReference"
                                    },
                                    "waitOnCompletion": true,
                                    "parameters": {
                                        "From": {
                                            "value": "@{item().From}",
                                            "type": "Expression"
                                        },
                                        "To": {
                                            "value": "@{item().To}",
                                            "type": "Expression"
                                        }
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        },
        "type": "Microsoft.DataFactory/factories/pipelines"
    }
    

    ExportFromCosmos - nested pipeline that's executed from the above pipeline. This is to get around the fact you can't have nested ForEach activities.

    {
        "name": "ExportFromCosmos",
        "properties": {
            "activities": [
                {
                    "name": "LookupDocuments",
                    "type": "Lookup",
                    "policy": {
                        "timeout": "7.00:00:00",
                        "retry": 0,
                        "retryIntervalInSeconds": 30,
                        "secureOutput": false,
                        "secureInput": false
                    },
                    "typeProperties": {
                        "source": {
                            "type": "DocumentDbCollectionSource",
                            "query": {
                                "value": "select c.id, c.partitionKey from c where c._ts >= @{pipeline().parameters.from} and c._ts <= @{pipeline().parameters.to} order by c._ts desc",
                                "type": "Expression"
                            },
                            "nestingSeparator": "."
                        },
                        "dataset": {
                            "referenceName": "CosmosDb",
                            "type": "DatasetReference"
                        },
                        "firstRowOnly": false
                    }
                },
                {
                    "name": "ForEachDocument",
                    "type": "ForEach",
                    "dependsOn": [
                        {
                            "activity": "LookupDocuments",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        }
                    ],
                    "typeProperties": {
                        "items": {
                            "value": "@activity('LookupDocuments').output.value",
                            "type": "Expression"
                        },
                        "activities": [
                            {
                                "name": "Copy1",
                                "type": "Copy",
                                "policy": {
                                    "timeout": "7.00:00:00",
                                    "retry": 0,
                                    "retryIntervalInSeconds": 30,
                                    "secureOutput": false,
                                    "secureInput": false
                                },
                                "typeProperties": {
                                    "source": {
                                        "type": "DocumentDbCollectionSource",
                                        "query": {
                                            "value": "select * from c where c.id = \"@{item().id}\" and c.partitionKey = \"@{item().partitionKey}\"",
                                            "type": "Expression"
                                        },
                                        "nestingSeparator": "."
                                    },
                                    "sink": {
                                        "type": "BlobSink"
                                    },
                                    "enableStaging": false
                                },
                                "inputs": [
                                    {
                                        "referenceName": "CosmosDb",
                                        "type": "DatasetReference"
                                    }
                                ],
                                "outputs": [
                                    {
                                        "referenceName": "BlobStorageDocuments",
                                        "type": "DatasetReference",
                                        "parameters": {
                                            "id": {
                                                "value": "@item().id",
                                                "type": "Expression"
                                            },
                                            "partitionKey": {
                                                "value": "@item().partitionKey",
                                                "type": "Expression"
                                            }
                                        }
                                    }
                                ]
                            }
                        ]
                    }
                }
            ],
            "parameters": {
                "from": {
                    "type": "int"
                },
                "to": {
                    "type": "int"
                }
            }
        }
    }
    

    BlobStorageTimestamps - dataset for the times.json file

    {
        "name": "BlobStorageTimestamps",
        "properties": {
            "linkedServiceName": {
                "referenceName": "AzureBlobStorage1",
                "type": "LinkedServiceReference"
            },
            "type": "AzureBlob",
            "typeProperties": {
                "format": {
                    "type": "JsonFormat",
                    "filePattern": "arrayOfObjects"
                },
                "fileName": "times.json",
                "folderPath": "mycollection"
            }
        },
        "type": "Microsoft.DataFactory/factories/datasets"
    }
    

    BlobStorageDocuments - dataset for where the documents will be saved

    {
        "name": "BlobStorageDocuments",
        "properties": {
            "linkedServiceName": {
                "referenceName": "AzureBlobStorage1",
                "type": "LinkedServiceReference"
            },
            "parameters": {
                "id": {
                    "type": "string"
                },
                "partitionKey": {
                    "type": "string"
                }
            },
            "type": "AzureBlob",
            "typeProperties": {
                "format": {
                    "type": "JsonFormat",
                    "filePattern": "arrayOfObjects"
                },
                "fileName": {
                    "value": "@{dataset().partitionKey}/@{dataset().id}.json",
                    "type": "Expression"
                },
                "folderPath": "mycollection"
            }
        },
        "type": "Microsoft.DataFactory/factories/datasets"
    }
    

    The times.json file it just a list of epoch times and looks like this:

    [{
        "From": 1556150400,
        "To": 1556236799
    },
    {
        "From": 1556236800,
        "To": 1556323199
    }]
    

提交回复
热议问题