How to copy CosmosDb docs to Blob storage (each doc in single json file) with Azure Data Factory

后端 未结 4 1442
清歌不尽
清歌不尽 2021-01-21 05:12

I\'m trying to backup my Cosmos Db storage using Azure Data Factory(v2). In general, it\'s doing its job, but I want to have each doc in Cosmos collection to correspond new json

相关标签:
4条回答
  • 2021-01-21 05:31

    Have you considered implementing this in a different way using Azure Functions? ADF is designed for moving data in bulk from one place to another and only generates a single file per collection.

    You could consider having an Azure Function that is triggered when documents are added / updated in your collection and have the Azure Function output the document to blob storage. This should scale well and would be relatively easy to implement.

    0 讨论(0)
  • 2021-01-21 05:40

    I also struggled a bit with this, especially getting around the size limits of the Lookup activity, since we have a LOT of data to migrate. I ended up creating a JSON file with a list of timestamps to query the Cosmos data with, then for each of those, getting the document IDs in that range, and then for each of those, getting the full document data and saving it to a path such as PartitionKey/DocumentID. Here's the pipelines I created:

    LookupTimestamps - loops through each timestamp range from a times.json file, and for each timestamp, executes the ExportFromCosmos pipeline

    {
        "name": "LookupTimestamps",
        "properties": {
            "activities": [
                {
                    "name": "LookupTimestamps",
                    "type": "Lookup",
                    "policy": {
                        "timeout": "7.00:00:00",
                        "retry": 0,
                        "retryIntervalInSeconds": 30,
                        "secureOutput": false,
                        "secureInput": false
                    },
                    "typeProperties": {
                        "source": {
                            "type": "BlobSource",
                            "recursive": false
                        },
                        "dataset": {
                            "referenceName": "BlobStorageTimestamps",
                            "type": "DatasetReference"
                        },
                        "firstRowOnly": false
                    }
                },
                {
                    "name": "ForEachTimestamp",
                    "type": "ForEach",
                    "dependsOn": [
                        {
                            "activity": "LookupTimestamps",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        }
                    ],
                    "typeProperties": {
                        "items": {
                            "value": "@activity('LookupTimestamps').output.value",
                            "type": "Expression"
                        },
                        "isSequential": false,
                        "activities": [
                            {
                                "name": "Execute Pipeline1",
                                "type": "ExecutePipeline",
                                "typeProperties": {
                                    "pipeline": {
                                        "referenceName": "ExportFromCosmos",
                                        "type": "PipelineReference"
                                    },
                                    "waitOnCompletion": true,
                                    "parameters": {
                                        "From": {
                                            "value": "@{item().From}",
                                            "type": "Expression"
                                        },
                                        "To": {
                                            "value": "@{item().To}",
                                            "type": "Expression"
                                        }
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        },
        "type": "Microsoft.DataFactory/factories/pipelines"
    }
    

    ExportFromCosmos - nested pipeline that's executed from the above pipeline. This is to get around the fact you can't have nested ForEach activities.

    {
        "name": "ExportFromCosmos",
        "properties": {
            "activities": [
                {
                    "name": "LookupDocuments",
                    "type": "Lookup",
                    "policy": {
                        "timeout": "7.00:00:00",
                        "retry": 0,
                        "retryIntervalInSeconds": 30,
                        "secureOutput": false,
                        "secureInput": false
                    },
                    "typeProperties": {
                        "source": {
                            "type": "DocumentDbCollectionSource",
                            "query": {
                                "value": "select c.id, c.partitionKey from c where c._ts >= @{pipeline().parameters.from} and c._ts <= @{pipeline().parameters.to} order by c._ts desc",
                                "type": "Expression"
                            },
                            "nestingSeparator": "."
                        },
                        "dataset": {
                            "referenceName": "CosmosDb",
                            "type": "DatasetReference"
                        },
                        "firstRowOnly": false
                    }
                },
                {
                    "name": "ForEachDocument",
                    "type": "ForEach",
                    "dependsOn": [
                        {
                            "activity": "LookupDocuments",
                            "dependencyConditions": [
                                "Succeeded"
                            ]
                        }
                    ],
                    "typeProperties": {
                        "items": {
                            "value": "@activity('LookupDocuments').output.value",
                            "type": "Expression"
                        },
                        "activities": [
                            {
                                "name": "Copy1",
                                "type": "Copy",
                                "policy": {
                                    "timeout": "7.00:00:00",
                                    "retry": 0,
                                    "retryIntervalInSeconds": 30,
                                    "secureOutput": false,
                                    "secureInput": false
                                },
                                "typeProperties": {
                                    "source": {
                                        "type": "DocumentDbCollectionSource",
                                        "query": {
                                            "value": "select * from c where c.id = \"@{item().id}\" and c.partitionKey = \"@{item().partitionKey}\"",
                                            "type": "Expression"
                                        },
                                        "nestingSeparator": "."
                                    },
                                    "sink": {
                                        "type": "BlobSink"
                                    },
                                    "enableStaging": false
                                },
                                "inputs": [
                                    {
                                        "referenceName": "CosmosDb",
                                        "type": "DatasetReference"
                                    }
                                ],
                                "outputs": [
                                    {
                                        "referenceName": "BlobStorageDocuments",
                                        "type": "DatasetReference",
                                        "parameters": {
                                            "id": {
                                                "value": "@item().id",
                                                "type": "Expression"
                                            },
                                            "partitionKey": {
                                                "value": "@item().partitionKey",
                                                "type": "Expression"
                                            }
                                        }
                                    }
                                ]
                            }
                        ]
                    }
                }
            ],
            "parameters": {
                "from": {
                    "type": "int"
                },
                "to": {
                    "type": "int"
                }
            }
        }
    }
    

    BlobStorageTimestamps - dataset for the times.json file

    {
        "name": "BlobStorageTimestamps",
        "properties": {
            "linkedServiceName": {
                "referenceName": "AzureBlobStorage1",
                "type": "LinkedServiceReference"
            },
            "type": "AzureBlob",
            "typeProperties": {
                "format": {
                    "type": "JsonFormat",
                    "filePattern": "arrayOfObjects"
                },
                "fileName": "times.json",
                "folderPath": "mycollection"
            }
        },
        "type": "Microsoft.DataFactory/factories/datasets"
    }
    

    BlobStorageDocuments - dataset for where the documents will be saved

    {
        "name": "BlobStorageDocuments",
        "properties": {
            "linkedServiceName": {
                "referenceName": "AzureBlobStorage1",
                "type": "LinkedServiceReference"
            },
            "parameters": {
                "id": {
                    "type": "string"
                },
                "partitionKey": {
                    "type": "string"
                }
            },
            "type": "AzureBlob",
            "typeProperties": {
                "format": {
                    "type": "JsonFormat",
                    "filePattern": "arrayOfObjects"
                },
                "fileName": {
                    "value": "@{dataset().partitionKey}/@{dataset().id}.json",
                    "type": "Expression"
                },
                "folderPath": "mycollection"
            }
        },
        "type": "Microsoft.DataFactory/factories/datasets"
    }
    

    The times.json file it just a list of epoch times and looks like this:

    [{
        "From": 1556150400,
        "To": 1556236799
    },
    {
        "From": 1556236800,
        "To": 1556323199
    }]
    
    0 讨论(0)
  • 2021-01-21 05:42

    Since your cosmosdb has array and ADF doesn't support serialize array for cosmos db, this is the workaround I can provide.

    First, export all your document to json files with export json as-is (to blob or adls or file systems, any file storage). I think you already knows how to do it. In this way, each collection will have a json file.

    Second, handle each json file, to exact each row in the file to a single file.

    I only provide pipeline for step 2. You could use execute pipeline activity to chain step 1 and step 2. And you could even handle all the collections in step 2 with a foreach activity.

    Pipeline json

    {
    "name": "pipeline27",
    "properties": {
        "activities": [
            {
                "name": "Lookup1",
                "type": "Lookup",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false
                },
                "typeProperties": {
                    "source": {
                        "type": "BlobSource",
                        "recursive": true
                    },
                    "dataset": {
                        "referenceName": "AzureBlob7",
                        "type": "DatasetReference"
                    },
                    "firstRowOnly": false
                }
            },
            {
                "name": "ForEach1",
                "type": "ForEach",
                "dependsOn": [
                    {
                        "activity": "Lookup1",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "typeProperties": {
                    "items": {
                        "value": "@activity('Lookup1').output.value",
                        "type": "Expression"
                    },
                    "activities": [
                        {
                            "name": "Copy1",
                            "type": "Copy",
                            "policy": {
                                "timeout": "7.00:00:00",
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false
                            },
                            "typeProperties": {
                                "source": {
                                    "type": "DocumentDbCollectionSource",
                                    "query": {
                                        "value": "select @{item()}",
                                        "type": "Expression"
                                    },
                                    "nestingSeparator": "."
                                },
                                "sink": {
                                    "type": "BlobSink"
                                },
                                "enableStaging": false,
                                "cloudDataMovementUnits": 0
                            },
                            "inputs": [
                                {
                                    "referenceName": "DocumentDbCollection1",
                                    "type": "DatasetReference"
                                }
                            ],
                            "outputs": [
                                {
                                    "referenceName": "AzureBlob6",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "id": {
                                            "value": "@item().id",
                                            "type": "Expression"
                                        },
                                        "PartitionKey": {
                                            "value": "@item().PartitionKey",
                                            "type": "Expression"
                                        }
                                    }
                                }
                            ]
                        }
                    ]
                }
            }
        ]
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
    

    }

    dataset json for lookup

       {
    "name": "AzureBlob7",
    "properties": {
        "linkedServiceName": {
            "referenceName": "bloblinkedservice",
            "type": "LinkedServiceReference"
        },
        "type": "AzureBlob",
        "typeProperties": {
            "format": {
                "type": "JsonFormat",
                "filePattern": "arrayOfObjects"
            },
            "fileName": "cosmos.json",
            "folderPath": "aaa"
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
    

    }

    Source dataset for copy. Actually, this dataset has no use. Just want to use it to host the query (select @{item()}

    {
    "name": "DocumentDbCollection1",
    "properties": {
        "linkedServiceName": {
            "referenceName": "CosmosDB-r8c",
            "type": "LinkedServiceReference"
        },
        "type": "DocumentDbCollection",
        "typeProperties": {
            "collectionName": "test"
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
    

    }

    Destination dataset. With two parameters, it also addressed your file name request.

    {
    "name": "AzureBlob6",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureStorage-eastus",
            "type": "LinkedServiceReference"
        },
        "parameters": {
            "id": {
                "type": "String"
            },
            "PartitionKey": {
                "type": "String"
            }
        },
        "type": "AzureBlob",
        "typeProperties": {
            "format": {
                "type": "JsonFormat",
                "filePattern": "setOfObjects"
            },
            "fileName": {
                "value": "@{dataset().PartitionKey}-@{dataset().id}.json",
                "type": "Expression"
            },
            "folderPath": "aaacosmos"
        }
    },
    "type": "Microsoft.DataFactory/factories/datasets"
    

    }

    please also note the limitation of Lookup activity: The following data sources are supported for lookup. The maximum number of rows can be returned by Lookup activity is 5000, and up to 2MB in size. And currently the max duration for Lookup activity before timeout is one hour.

    0 讨论(0)
  • 2021-01-21 05:45

    Just take one collection as an example.

    And inside the foreach:

    And your lookup and copy activity source dataset reference the same cosmosdb dataset.

    If you want to copy your 5 collections, you could put this pipeline into an execute activity. And the master pipeline of the execute activity has a foreach activity.

    0 讨论(0)
提交回复
热议问题