Using UUIDs instead of ObjectIDs in MongoDB

后端 未结 5 418
误落风尘
误落风尘 2021-01-30 03:32

We are migrating a database from MySQL to MongoDB for performance reasons and considering what to use for IDs of the MongoDB documents. We are debating between using ObjectIDs,

相关标签:
5条回答
  • 2021-01-30 04:23

    Consider the amount of data you would store in each case.

    A MongoDB ObjectID is 12 bytes in size, is packed for storage, and its parts are organized for performance (i.e. timestamp is stored first, which is a logical ordering criteria).

    Conversely, a standard UUID is 36 bytes, contains dashes and is typically stored as a string. Further, even if you strip non-numeric characters and intend to store numerically, you must still content with its "indexy" portion (the part of a UUID v1 that is timestamp-based) is in the middle of the UUID, and doesn't lend itself well to sorting. There are studies done which allow for performant UUID storage, and I even wrote a Node.js library to assist in its management.

    If you're intend on using a UUID, consider reorganizing it for optimal indexing and sorting; otherwise you'll likely hit a performance wall.

    0 讨论(0)
  • 2021-01-30 04:26

    The _id field of MongoDB can have any value you want as long as you can guarantee that it is unique for the collection. When your data already has a natural key, there is no reason not to use this in place of the auto-generated ObjectIDs.

    ObjectIDs are provided as a reasonable default solution to safe time generating an own unique key (and to discourage beginners from trying to copy SQL's AUTO INCREMENT which is a bad idea in a distributed database).

    By not using ObjectIDs you also miss out on another convenience feature: An ObjectID also includes an unix timestamp when it was generated, and many drivers provide a funtion to extract it and convert it to a date. This can sometimes make a separate create-date field redundant.

    But when neither is a concern for you, you are free to use your UUIDs as _id field.

    0 讨论(0)
  • 2021-01-30 04:29

    I found these Benchmarks sometime ago when I had the same question. They basically show that using a Guid instead of ObjectId causes Index Performance drop.

    I would anyways recommend that you customize the Benchmarks to imitate your specific real life scenario and see how the numbers look like, one cannot rely 100% on generic Benchmarks.

    0 讨论(0)
  • 2021-01-30 04:34

    We must be careful to distinguish the cost of MongoDB inserting a thing vs. the cost to generate the thing in the first place plus that cost relative to the size of the payload. Below is a little matrix that shows method of generating the _id crossed against the size of an optional extra bytes worth of payload. Tests are using javascript only, conducted on MacBook Pro localhost for 100,000 inserts using insertMany of batches of 100 without transactions to try to remove network, chatty, and other factors. Two runs with batch = 1 were also done just to highlight the dramatic difference.

    
    Method                                                                                         
    A  :  Simple int:          _id:0, _id:1, ...                                                   
    B  :  ObjectId             _id:ObjectId("5e0e6a804888946fa61a1976"), ...                       
    C  :  Simple string:       _id:"A0", _id:"A1", ...                                             
    
    D  :  UUID length string   _id:"9575edcc-cb70-4d63-97ed-ee5d624de87b0", ...                    
          (but not actually                                                                        
          generated by UUID()                                                                      
    
    E  :  Real generated UUID  _id: UUID("35992974-21ea-4f61-b715-2dfaed663b73"), ...              
          (stored UUID() object)                                                                   
    
    F  :  Real generated UUID  _id: "6b16f733-ff24-4172-83f9-e4f96ace6775"                         
          (stored as string, e.g.                                                                  
          UUID().toString().substr(6,36)                                                           
    
    Time in milliseconds to perform 100,000 inserts on fresh (empty) collection.
    
    Extra                M E T H O D   (Batch = 100)                                                               
    Payload   A     B     C     D     E     F       % drop A to F                                  
    --------  ----  ----  ----  ----  ----  ----    ------------                                   
    None      2379  2386  2418  2492  3472  4267    80%                                            
    512       2934  2928  3048  3128  4151  4870    66%                                            
    1024      3249  3309  3375  3390  4847  5237    61%                                            
    2048      3953  3832  3987  4342  5448  5888    49% 
    4096      6299  6343  6199  6449  7634  8640    37%                                            
    8192      9716  9292  9397 10816 11212 11321    16% 
    
    Extra              M E T H O D   (Batch = 1)                                          
    Payload   A      B      C      D      E      F       % drop A to F              
    --------  -----  -----  -----  -----  -----  -----                              
    None      48006  48419  49136  48757  50649  51280   6.8%                       
    1024      50986  50894  49383  49373  51200  51821   1.2%                       
    
    
    

    This was a quicky test but it seems clear that basic strings and ints as _id are roughly the same speed but actually generating a UUID adds time -- especially if you take the string version of the UUID() object, e.g. UUID().toString().substr(6,36) It is also worth noting that constructing an ObjectId appears to be as quick.

    0 讨论(0)
  • I think this is a great idea and so does Mongo; they list UUIDs as one of the common options for the _id field.

    Considerations:

    • Performance -- As other answers mention, benchmarks show UUIDs cause a performance drop for inserts. In the worst case measured (going from 10M to 20M docs in a collection) they've about ~2-3x slower -- the difference between inserting 2,000 (UUID) and 7,500 (ObjectID) docs per second. This is a large difference but it's significance depends entirely on you use case. Will you be batch inserting millions of docs at a time? For most apps I've build the common case is inserting individual documents. In that test the difference is much smaller (6,250 -vs- 7,500; ~20%). The ID type is simply not the limiting factor.
    • Portability -- Other DBs certainly do tend to have good UUID support so portability would be improved. Alternatively, since UUIDs are larger (more bits) it is possible to repack an ObjectID into the "shape" of a UUID. This approach isn't as nice as direct portability but it does give you a path forward.

    Counter to some of the other answers:

    • UUIDs have native support -- You can use the UUID() function in the Mongo Shell exactly the same way you'd use ObjectID(); to convert a string into equivalent BSON object.
    • UUIDs are not especially large -- They're 128 bit compared to ObjectIDs which are 96 bit. (They should be encoded using binary subtype 0x04.)
    • UUIDs can include a timestamp -- Specifically, UUIDv1 encodes a timestamp with 60 bits of precision, compared to 32 bits in ObjectIDs. This is over 6 orders of magnitude more precision, so nano-seconds instead of seconds. It can actually be a decent way of storing create timestamps with more accuracy than Mongo/JS Date objects support, however...
      • The build in UUID() function only generates v4 (random) UUIDs so, to leverage this this, you'd to lean on on your app or Mongo driver for ID creation.
      • Unlike ObjectIDs, because of the way UUIDs are chunked, the timestamp doesn't give you a natural order. This can be good or bad depending on your use case.
      • Including timestamps in your IDs is often a Bad Idea. You end up leaking the created time of documents anywhere an ID is exposed. To make maters worse, v1 UUIDs also encode a unique identifier for the machine they're generated on which can expose additional information about your infrastructure (eg. number of servers). Of course ObjectIDs also encode a timestamp so this is partly true for them too.
    0 讨论(0)
提交回复
热议问题