hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...
DAT
You could do it as a 2-stage operation:
First step, emit (username => siteID)
, and have the reducer just collapse multiple occurrences of siteID using a set
- since you'd typically have far less sites than users, this should be fine.
Then in the second step, you can emit (siteID => username)
and do a simple count, since the duplicates have been removed.