create mongodb document with subdocuments atomically?

问题

I hope I'm having a big brainfart moment. But here's my situation in a scraping szenario;

I'm wanting to be able to scrape over multiple machines and cores. Per site, I have different Front pages, I scrape (exmpl. for the site stackoverflow I'd have fronts stackoverflow.com/questions/tagged/javascript and stackoverflow.com/questions/tagged/nodejs).

An article could be on every Front and when I discover an article I want to create an Article if the url is unknown, if its known I want to make an Front entry in article.discover if Front is unknown and otherwise insert my FrontDiscovery to the apropriate Front.

Here are my Schemas;

FrontDiscovery = new Schema({
    _id         :{ type:ObjectId, auto:true },
    date        :{ type: Date, default:Date.now},
    dims        :{ type: Object, default:null},
    pos         :{ type: Object, default:null}
});

Front = new Schema({
    _id         :{ type:ObjectId, auto:true },
    url         :{type:String}, //front
    found       :[ FrontDiscovery ]
});

Article = new Schema({
    _id         :{ type:ObjectId, auto:true },
    url         :{ type: String , index: { unique: true } },
    site        :{ type: String },
    discover:[ Front]
});

The Problem I am thinking I will eventually be running into is a race condition. When two job-runners (in parallel) find the same (before unknown) article and create a new one. Yes, I have a unique index on it and could handle it that way - quite inelegantly imho.

But lets go further; When - for what ever reason - my 2 job-runners are scraping the same front at the same time and both notice that for Front there is no entry yet and create a new one adding the FrontDiscovery, I'd end with two entry's for the same Front.

What are your strategies to circumvent such a situation? findByIdAndUpdate with the upsert:true for each document seperately? If so, how can I only push something to the embedded document collection and not overwrite everything else at the same time but still create the defaults if it hasnt been created?

Thank you for any help in directing me in the right direction! I really hope I'm having a massive brainfart..

回答1:

Update with upsert=true can be used to perform an atomic "insert or update" (http://docs.mongodb.org/manual/core/update/#update-operations-with-the-upsert-flag).

For instance if we want to make sure a document in Front collection with specific url is inserted exactly once, we could run something like:

db.Front.update(
    {url: 'http://example.com'},
    {$set: {
       url: 'http://example.com'},
       found: true
    }
)

Operations on a single document in MongoDB are always atomic. If you make updates that span over multiple documents, then no atomicity is guaranteed. In such cases, you can ask yourself: do I really need the operations to be atomic? If the answer is no, then you probably will find your way around working with potentially unconsistent data. If the answer is yes and you want to stick with MongoDB, check out the design pattern on Two Phase Commits.

来源：https://stackoverflow.com/questions/17774929/create-mongodb-document-with-subdocuments-atomically

标签

node.js

mongodb

parallel-processing

mongoose

database-schema