Real-time statistics: MySQL(/Drizzle) or MongoDB?

后端 未结 2 1271
野性不改
野性不改 2021-02-01 23:27

We are working on a project that will feature real-time statistics of some actions (e.g. clicks). On every click, we will log information like date, age and gender (these come f

2条回答
  •  天涯浪人
    2021-02-02 00:27

    So BuddyMedia is using some of this. The Gilt Groupe has done something pretty cool with Hummingbird (node.js + MongoDB).

    Having worked for a large online advertiser in the Social Media space, I can attest that real-time reporting is really a pain. Trying to "roll-up" 500M impressions a day is already a challenge, but trying to do it real time worked, but it carried some significant limitations. (like it was actually delayed by 5-minutes :)

    Frankly, this type of problem is one of the reasons I started using MongoDB. And I'm not the only one. People are using MongoDB for all kinds of real-time analytics: server monitoring, centralized logging, as well as dashboard reporting.

    The real key when doing this type of reporting is to understand that the data structure is completely different with MongoDB, you're going to avoid "aggregation" queries, so the queries and the output charts are going to be different. There's some extra coding work on the client side.

    Here's the key that may point you in the right direction for doing this with MongoDB. Take a look at the following data structure:

    {
      date: "20110430",
      gender: "M",
      age: 1, // 1 is probably a bucket
      impression_hour: [ 100, 50, ...], // 24 of these
      impression_minute: [ 2, 5, 19, 8, ... ], // 1440 of these
      clicks_hour: [ 10, 2, ... ],
      ...
    }
    

    There are obviously some tweaks here, appropriate indexes, maybe mushing data+gender+age into an _id. But that's kind of the basic structure of click analytics with MongoDB. It's really easy to update impression and clicks { $inc : { clicks_hour.0 : 1 } }. You get to update the whole document atomically. And it's actually pretty natural to report on. You already have your an array containing your hourly or minute-level data points.

    Hopefully that's points you in the right direction.

提交回复
热议问题