I have some 25k documents (4 GB in raw json) of data that I want to perform a few javascript operations on to make it more accessible to my end data consumer (R
), a
I faced the same situation. I was able to accomplish this via Mongo query and projection. see Mongo Query
When you got access to the mongo shell, it accepts some Javascript commands and then it's simpler:
map = function(item){
db.result.insert(item);
}
db.collection.find().forEach(map);
When using map/reduce you'll always end up with
{ "value" : { <reduced data> } }
In order to remove the value
key you'll have to use a finalize
function.
Here's the simplest you can do to copy data from one collection to another:
map = function() { emit(this._id, this ); }
reduce = function(key, values) { return values[0]; }
finalize = function(key, value) { db.collection_2.insert(value); }
Then when you would run as normal:
db.collection_1.mapReduce(map, reduce, { finalize: finalize });
Only map without reduce is like copying a collection: http://www.mongodb.org/display/DOCS/Developer+FAQ#DeveloperFAQ-HowdoIcopyallobjectsfromonedatabasecollectiontoanother%3F
But that seems awkward and I don't know why it even works, since my
emit
call arguments in my mapper are not equivalent to the return argument of myreducer
.
They are equivalent. The reduce function takes in an array of T
values and should return a single value in the same T
format. The format of T
is defined by your map function. Your reduce function simply returns the first item in the values array, which will always be of type T
. That's why it works :)
You seem to be on the right track. I did some experimenting and it seems you cannot do a db.collection.save()
from the map function, but you can do this from the reduce function. Your map function should simply construct the document format you need:
function map() {
emit(this._id, { _id: this.id, heading: this.title, body: this.content });
}
The map function reuses the ID of the original document. This should prevent any re-reduce steps, since no values will share the same key.
The reduce function can simply return null
. But in addition, you can write the value to a separate collection.
function reduce(key, values) {
db.result.save(values[0]);
return null;
}
Now db.result
should contain the transformed documents, without any additional map-reduce noise you'd have in the temporary collection. I haven't actually tested this on large amounts of data, but this approach should take advantage of the parallelized execution of map-reduce functions.