What is fastest structure to read in MongoDB: multiple documents or subdocuments?

问题

Intro

I use Mongo to store moderately long financial timeseries, which I can read in 2 ways:

retrieve 1 series for its entire length
retrieve N series on a specific date

To facilitate the second type of query, I slice the series by year. This reduces the data load when querying for large number of series on a specific day (example: if I query the value of 1000 timeseries on a specific day, it is not feasible to query back the entire history of each, which can go back 40 years = 28k each)

Question

Writes are not time-sensitive. Storage space is plentiful. Reads are time-sensitive. What is the best option to archive data for fast reads of both first and second kind?

Option A - Separate documents

{_id:xxx, stock:IBM, year:2014, prices:[<daily prices for 2014>]}
{_id:xxx, stock:IBM, year:2015, prices:[<daily prices for 2015>]}

In option A, I would find() with a compound index on year and stock

Option B - Sub-documents

{
 _id:xxx,
 stock:IBM,
 2014:[<daily prices for 2014>],
 2015:[<daily prices for 2015>],
 }

In option B, I would find() on a simple index on stock, and add a projection to only return the year I look for

Option B.1 - Sub-documents with zipped content

Same as above, but the <daily prices for 201x> are zipped by jsoning and zlibbing them

Option C - Sub-documents with daily data

{
 _id:xxx,
 stock:IBM,
     0:<price for day 0 of 2014>,
     1:<price for day 1 of 2014>,
     ...
     n:<price for day n of 2015>,  //n can be as large as 10.000
 }

Option D - Nested Sub-documents

{
 _id:xxx,
 stock:IBM,
 2014:{
     0:<price for day 0>,
     1:<price for day 1>,
     ...
     }
 2015:{
     0:<price for day 0>,
     1:<price for day 1>,
     ...
     }

I would then have to apply a query approach like this. To note that option D might double the data required to do a read of the first type described above.

回答1:

Hm, I think I can improve your model to be easier:

{
  _id: new ObjectId()
  key: "IBM",
  date: someISODate,
  price: somePrice,
  exchange: "NASDAQ"
}
db.stocks.createIndex({key:1, date:1, exchange:1})

In this model, you have all the information you need:

db.stocks.find({
  key: "IBM", 
  date: { 
    $gte: new ISODate("2014-01-01T00:00:00Z"),
    $lt: new ISODate("2015-01-01T00:00:00Z")
  }
})

For example, if you wanted to know the average price of the IBM stock in May 2014, you'd use an aggregation:

db.stocks.aggregate([
  { $match: {
      key: "IBM",
      date:{
        $gte: new ISODate("2014-05-01T00:00:00Z"),
        $lt: new ISODate("2014-06-01T00:00:00Z")
      }
  },
  { $group: {
      _id: {
        stock: "$key",
        month: { $month:"$date"},
        year: { $year:"$date" }
      },
      avgPrice: {$avg: "$price" }
    }
  }
]}

Would result in a returned document like:

{
  _id: {
    stock: "IBM",
    year: "2014",
    month: "5"
  },
  avgPrice: "8000.42"
}

You could even precalculate the averages for every stock and every month rather easily

db.stocks.aggregate([
  {
    $group: {
        _id: {
          stock: "$key",
          month: { $month: "$date" },
          year: { $year: "$date" }
        },
        averagePrice: {$avg:"$price"}
    }
  },
  { $out: "avgPerMonth" }
]}

finding the average for IBM in May 2014 now becomes a simple query:

db.avgPerMonth.find({
   "_id":{
     "stock":"IBM",
     "month":"5",
     "year":"2014"
   }
})

And so on. You really want to use aggregations with stocks. For example: "In which month of the year was the IBM stock most expensive historically?"

Nice, easy, with optimum performance for both reads and writes. Also, you save multiple $unwind statements (for arbitrary keys that's not too easy, anyway) for the aggregation queries, too.

Granted, we have the redundancy of the duplicate values for key, but we circumvent a few problems:

BSON documents are limited to a size of 16MB, so your model would impose a theoretical limit.
When using the mmapv1 storage engine of MongoDB (which is the only one available for MongoDB < 3.0 and default for > 3.0), expanding the size of documents may result in a rather expensive document migration within a data file, since documents are guaranteed to never be fragemented.
Complicated models lead to complicated code. Complicated code is harder to maintain. The harder code is to maintain, the longer it takes. The longer you need for a task, the more expensive (money wise and/or time wise) code maintenance becomes. Conclusion: Complicated models are more expensive than easy ones.

Edit

For the dates, you need to make sure that you keep in mind the different time zones and either normalize them to Zulu Time stay within the time zone of an exchange when doing aggregations to be precise as of dates.

回答2:

Current solution:

I have found this approach based on Option A to be quite performing, for both kind of reads described above:

cursor = mycollection.find({'year':{ '$in': years}, 'stocks':{ '$in': stocks }}).hint('year_1_ind_1')

docs = [d for d in docs]

来源：https://stackoverflow.com/questions/32486038/what-is-fastest-structure-to-read-in-mongodb-multiple-documents-or-subdocuments

标签

mongodb

pymongo