Microsoft Cosmosdb for Mongodb: merge unsharded collection into sharded ones

问题

I have 2 collections of similar documents(i.e. same object, different values). One collection(X) is unsharded in database A, another collection(Y) is sharded and inside database B. When I try copy collection X into database B, I got error saying that "Shared throughput collection should have a partition key". I also tried copying data using foreach insert, but it takes too long time.

So my question is, how can I append the data from collection X to collection Y in efficient way?

Mongodb version on CosmosDB is 3.4.6

回答1:

You may perform aggregation and add as last stage $merge operator.

| $merge                                | $out                                       | 
| Can output to a sharded collection.   | Cannot output to a sharded collection.     | 
| Input collection can also be sharded. | Input collection, however, can be sharded. |

https://docs.mongodb.com/manual/reference/operator/aggregation/merge/#comparison-with-out

回答2:

So my question is, how can I append the data from collection X to collection Y in efficient way?

The server tools mongodump and mongorestore can be used. You can export the source collection data into BSON dump files and import into the target collection. These processess are very quick, because the data in the database is already in BSON format.

Data can be exported from a non-sharded collection to a sharded collection using these tools. In this case, it is required that the source collection has the shard-key field (or fields) with values. Note the indexes from the source collection are also exported and imported (using these tools).

Here is an example of the scenario in question:

mongodump --db=srcedb --collection=srcecoll --out="C:\mongo\dumps"

This creates a dump directory with the database name. There will be "srcecoll.bson" file in it and it is used for importing.

mongorestore --port 26xxxx --db=trgtdb --collection=trgtcoll --dir="C:\mongo\dumps\srcecoll.bson"

The host/port connects to the mongos of the sharded cluster. Note the bson file name need to be specified in the --dir option.

The import adds data and indexes into the existing sharded collection. The process only inserts data; the existing documents cannot be updated. If the _id value from the source collection already exists in the target collection, the process will not overwrite the documents (and those documents will not be imported, and it is not an error).

There are some useful options for mongorestore like: --noIndexRestore and --dryRun.

回答3:

Because, the MongoDb version in CosmosDB currently 3.4.6, it doesn't support $merge and a lot of other commands such as colleciton.copyTo etc. Using Studio 3T's import feature didn't help as well.

The solution I use, is to download the target collection on my local mongodb, clean it then write java code that will read my clean data from local db and insertMany(or bulkwrite) it to the target collection. This way, the data will be appended to the target collection. The speed I measured was 2 hours for 1m document count(~750MB), of course, this numbers might vary depending on various factors, i.e. network, document size etc.

来源：https://stackoverflow.com/questions/59614419/microsoft-cosmosdb-for-mongodb-merge-unsharded-collection-into-sharded-ones

标签

mongodb

azure-cosmosdb-mongoapi