问题
I'm working on a MongoDB project which stores tweets and was created by someone else.
This person decided to use the Twitter tweet ID for the _id
field in MongoDB, which means I now have no way to sort the tweets deterministically.
Example:
> db.tweets.find().sort({_id : 1}).limit(4)
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(1)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(2)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(3)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(5)}
The reason sorting on the field ID is non-deterministic is that at a later date, my system could add the existing tweet that has an ID of 4 to the database, meaning that the same command would give a different result set:
> db.tweets.find().sort({_id : 1}).limit(4)
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(1)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(2)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(3)}
{'message' : '...', 'userId' : NumberLong(123), '_id' : NumberLong(4)}
My question is: is there a way to add a new 'field' to every entry in a collection, with a value of type ObjectID
, so that I can sort on that?
Or if not, what would the recommendations be for 'renaming' the _id
field to say tweetId
and then making the _id
field of type ObjectID
Thanks
回答1:
The only way to actually change the _id
field would be to copy the document, change the _id
, and delete the old document, as described in this answer:
How update the _id of one MongoDB Document?
To simply add a new field, passing an update function to a cursor's forEach
function should work:
db.tweets.find().forEach(
function (tweet) {
db.tweets.update({_id: tweet._id}, {$set: {newFieldName: tweet._id}});
}
);
回答2:
Some of the snippets in the post that Shawn linked to had several flaws. Whilst the idea was right, using the command line mongo
could cause several problems.
Getting a 'snapshot' of all the tweets before any new ones are added is difficult in mongo
. The only way I could find to do it was to use:
$ db.tweets.find({}, {_id : 1}).toArray()
or perhaps
$ db.tweets.distinct('_id')
Unfortunately, as I had over 2 million tweets in my database this caused mongo
to run out of memory. I got a "exception: distinct too big, 16mb cap"
errir,
Instead I used Python, here's the script:
#!/usr/bin/env python
"""A tool to work through all tweets, and convert the '_id'
from the Tweet ID into an ObjectID (saving the tweet)
ID in the 'tweetID' field
"""
import pymongo
from bson.objectid import ObjectId
if __name__ == "__main__":
client = pymongo.MongoClient()
db = client.guaiamum
ids = list(t['_id'] for t in db.tweets.find({'_id': {'$type' : 18}}, {'_id' : 1}))
for _id in ids:
tweet = db.tweets.find_one({'_id' : _id})
tweet['_id'] = ObjectId()
tweet['twitterId'] = _id
db.tweets.insert(tweet)
db.tweets.remove(_id, multi=False)
It still took a good 1.5 hrs to run, but bizarrely that's still much quicker than using mongo
来源:https://stackoverflow.com/questions/27948431/sorting-mongodb-collection-deterministically-add-new-objectid-field