What's the best way to count unique visitors with Hadoop?

前端未结

关注

 4  1833

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...

DAT


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2021-01-02 10:16
              
            
            
                                                                       
You could do it as a 2-stage operation:

First step, emit (username => siteID), and have the reducer just collapse multiple occurrences of siteID using a set - since you'd typically have far less sites than users, this should be fine. 

Then in the second step, you can emit (siteID => username) and do a simple count, since the duplicates have been removed. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2021-01-02 10:21
              
            
            
                                                                       
It is often faster to use HiveQL to sort many simple tasks. Hive will translate your queries into Hadoop MapReduce. In this case you may use

SELECT COUNT(DISTINCT username) FROM logviews


You may find a more advanced example here:
http://www.dataminelab.com/blog/calculating-unique-visitors-in-hadoop-and-hive/
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-01-02 10:31
              
            
            
                                                                       
My aproach is similar to what tzaman gave with a small twist


map output   : (username, siteid) => ("")
reduce output: (siteid) => (1)
map          : identity mapper
reduce       : longsumreducer (i.e. simply summarize)


Note that the first reduce does not need to go over any of the records is gets presented. You can simply examine the key and produce the output.

HTH
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情深已故        
                
              
                            
                2021-01-02 10:33
              
            
            
                                                                       
Use the secondary sort to sort on user id. That way, you don't need to have anything in memory -- just stream the data through, and increment your distinct counter every time you see the value change for a particular site id.

Here is some documentation.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复