问题
I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a:
SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)
But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast.
A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it?
Thanks!
回答1:
The mapreduce approach you suggest is exactly what you want. Don't forget to use transactions to update the record in your task queue task, which will allow you to run it in parallel with many mappers.
In future, reduce support will make this possible with a single straightforward mapreduce and no hacking around with your own transactions and models.
回答2:
If time is not important and you may try taskqueue with a task limit of 1. Basically you'd use a recursive task that queries through a batch of log records until it hits DeadlineExceededError. Then you'd write the results to datastore and the task would enqueue itself with the query end cursor/last record's key value to start the fetch operation where it stopped last time.
来源:https://stackoverflow.com/questions/5566908/calculating-unique-elements-from-huge-list-in-google-app-engine