MongoDB: Terrible MapReduce Performance

前端 未结 4 946
攒了一身酷
攒了一身酷 2020-12-07 10:42

I have a long history with relational databases, but I\'m new to MongoDB and MapReduce, so I\'m almost positive I must be doing something wrong. I\'ll jump right into the qu

相关标签:
4条回答
  • 2020-12-07 11:12

    excerpts from MongoDB Definitive Guide from O'Reilly:

    The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.

    options for map/reduce:
    
    "keeptemp" : boolean 
    If the temporary result collection should be saved when the connection is closed. 
    
    "output" : string 
    Name for the output collection. Setting this option implies keeptemp : true. 
    
    0 讨论(0)
  • 2020-12-07 11:12

    You are not doing anything wrong. (Besides sorting on the wrong value as you already noticed in your comments.)

    MongoDB map/reduce performance just isn't that great. This is a known issue; see for example http://jira.mongodb.org/browse/SERVER-1197 where a naive approach is ~350x faster than M/R.

    One advantage though is that you can specify a permanent output collection name with the out argument of the mapReduce call. Once the M/R is completed the temporary collection will be renamed to the permanent name atomically. That way you can schedule your statistics updates and query the M/R output collection real-time.

    0 讨论(0)
  • 2020-12-07 11:14

    Have you already tried using hadoop connector for mongodb?

    Look at this link here: http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/

    Since you are using only 3 shards, I don't know whether this approach would improve your case.

    0 讨论(0)
  • 2020-12-07 11:23

    Maybe I'm too late, but...

    First, you are querying the collection to fill the MapReduce without an index. You shoud create an index on "day".

    MongoDB MapReduce is single threaded on a single server, but parallelizes on shards. The data in mongo shards are kept together in contiguous chunks sorted by sharding key.

    As your sharding key is "day", and you are querying on it, you probably are only using one of your three servers. Sharding key is only used to spread the data. Map Reduce will query using the "day" index on each shard, and will be very fast.

    Add something in front of the day key to spread the data. The username can be a good choice.

    That way the Map reduce will be launched on all servers and hopefully reducing the time by three.

    Something like this:

    use admin
    db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } );
    db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } );
    db.runCommand( { enablesharding : "profiles" } );
    db.runCommand( { shardcollection : "profiles.views", key : {username : 1,day: 1} } );
    use profiles
    db.views.ensureIndex({ hits: -1 });
    db.views.ensureIndex({ day: -1 });
    

    I think with those additions, you can match MySQL speed, even faster.

    Also, better don't use it real time. If your data don't need to be "minutely" precise, shedule a map reduce task every now an then and use the result collection.

    0 讨论(0)
提交回复
热议问题