What's the best way to count unique visitors with Hadoop?

前端 未结 4 1832
無奈伤痛
無奈伤痛 2021-01-02 09:33

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...

DAT         


        
4条回答
  •  迷失自我
    2021-01-02 10:16

    You could do it as a 2-stage operation:

    First step, emit (username => siteID), and have the reducer just collapse multiple occurrences of siteID using a set - since you'd typically have far less sites than users, this should be fine.

    Then in the second step, you can emit (siteID => username) and do a simple count, since the duplicates have been removed.

提交回复
热议问题