Google Analytics database [closed]

后端未结

关注

 6  974

北恋 2021-02-07 07:40

6条回答

太阳男子 (楼主)

2021-02-07 08:14

I can't know exactly how they implement it. But because I've made a product that extracts non-sampled, non-aggregated data from Google Analytics I have learned a thing or two about the structure.

I makes sense that the data is populated via BigTable. BT offers localization data awareness and map/reduce querying across n-nodes.

Distinct counts (Whether a data service can provide distinct counts or not is a simple measure of flexibility of a data model - but it's typically also a measure of cost and performance)

Google Analytics is not built to do distinct counts even though GA can count users across almost any dimension - but it can't count e.g. Sessions per ga:pagePath? How so... Well they only register a session with the first pageView in a session. This means that we can only count how many landingpages that have had a session. We have no count for all the other 99% of pages on your site. :/

The reason for this is that Google made the choice NOT to count discount counts at all. It simply doesn't scale well economically when serving millions of sites for free. They needed an approach where they could avoid counting distinct. Distinct count is all about sorting, grouping lists of ids for every cell in data intersection.

But... Isn't it simple to count the distinct number of session on a ga:pagePath value? I'll answer this in a bit

The User and data partitioning The choice they made was to partition data on users (clientIds or userIds) Because when they know that clientId/userId X is only present in a certain table in BT, they can run a map/reduce function that counts users and they don't have to be concerned that the same user is present in another dataset and be forced to store all clientIds/userIds in a list - group them - and then count them - distinct. Since the current GA tracking script is called Universal Analytics they have to be able to count users correct. Especially when focusing on cross-device tracking.

OK, but how does this affect session count? You have a set of users, each having multiple sets of sessions each having a list of page hits. When counting within a specific session looking for a pagePaths, you will find the same page multiple times but you will not count the page more than once. You need to write down you've already seen this page before. When you have traversed all pages within that session you need only count the session once per page. This procedure requires a state/memory. And since the counting process is probably done in parallel on the same server. You can't be sure that a specific session is handled by the same process. Which makes the counting even more memory consuming. Google decided not to chase that rabit any longer and just ignore that the session count is wrong for pagePath and other hit scoped dimensions.

"Cube" storage The reason I write "cube" is that I don't know exactly if they use traditional a OLAP cube structure, but I know they have up to 100 cubes populated for answering different dimension/metric combinations.

By isolation/grouping dimensions in smaller cubes, data won't explode exponentially like it would if they put all data in a single cube. The drawback is that not all data combinations are allowed. Which we know is true. E.g. ga:transactionId and ga:eventCategory can't be queried together.

By choosing this structure the dataset can scale well economical and performance-wise

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题