问题
I hope I'm having a big brainfart moment. But here's my situation in a scraping szenario;
I'm wanting to be able to scrape over multiple machines and cores. Per site, I have different Front
pages, I scrape (exmpl. for the site stackoverflow I'd have fronts stackoverflow.com/questions/tagged/javascript and stackoverflow.com/questions/tagged/nodejs).
An article
could be on every Front
and when I discover an article I want to create an Article
if the url is unknown, if its known I want to make an Front
entry in article.discover
if Front
is unknown and otherwise insert my FrontDiscovery
to the apropriate Front
.
Here are my Schemas;
FrontDiscovery = new Schema({
_id :{ type:ObjectId, auto:true },
date :{ type: Date, default:Date.now},
dims :{ type: Object, default:null},
pos :{ type: Object, default:null}
});
Front = new Schema({
_id :{ type:ObjectId, auto:true },
url :{type:String}, //front
found :[ FrontDiscovery ]
});
Article = new Schema({
_id :{ type:ObjectId, auto:true },
url :{ type: String , index: { unique: true } },
site :{ type: String },
discover:[ Front]
});
The Problem I am thinking I will eventually be running into is a race condition. When two job-runners (in parallel) find the same (before unknown) article and create a new one. Yes, I have a unique index on it and could handle it that way - quite inelegantly imho.
But lets go further; When - for what ever reason - my 2 job-runners are scraping the same front at the same time and both notice that for Front
there is no entry yet and create a new one adding the FrontDiscovery
, I'd end with two entry's for the same Front
.
What are your strategies to circumvent such a situation? findByIdAndUpdate with the upsert:true for each document seperately? If so, how can I only push something to the embedded document collection and not overwrite everything else at the same time but still create the defaults if it hasnt been created?
Thank you for any help in directing me in the right direction! I really hope I'm having a massive brainfart..
回答1:
Update with upsert=true
can be used to perform an atomic "insert or update" (http://docs.mongodb.org/manual/core/update/#update-operations-with-the-upsert-flag).
For instance if we want to make sure a document in Front collection with specific url
is inserted exactly once, we could run something like:
db.Front.update(
{url: 'http://example.com'},
{$set: {
url: 'http://example.com'},
found: true
}
)
Operations on a single document in MongoDB are always atomic. If you make updates that span over multiple documents, then no atomicity is guaranteed. In such cases, you can ask yourself: do I really need the operations to be atomic? If the answer is no, then you probably will find your way around working with potentially unconsistent data. If the answer is yes and you want to stick with MongoDB, check out the design pattern on Two Phase Commits.
来源:https://stackoverflow.com/questions/17774929/create-mongodb-document-with-subdocuments-atomically