Geographical distance by group - Applying a function on each pair of rows

前端未结

关注

 7  844

清歌不尽 2020-12-21 05:04

I want to calculate the average geographical distance between a number of houses per province.

Suppose I have the following data.

df1 <- data.fram


      
      
        
          7条回答        

        
                    
            
            
                         
                
              
              
                
                   隐瞒了意图╮
                                             
                
                
                (楼主)
            
              
              
                2020-12-21 05:20
              

            
            
                        
Given that your data has millions of rows, this sounds like an "XY" problem.  I.e. the answer you really need is not the answer to the question you asked.  

Let me give an analogy: if you want to know the average height of trees in a forest you do not measure every tree.  You just measure a large enough sample to ensure that your estimate has a high enough probability of being as close to the true average as you need.

Performing a brute force calculation using the distance from every house to every other house will not only take excessive resources (even with optimised code), but also it will provide far more decimal places than you could possibly need, or are justified by the data accuracy (GPS coordinates are typically only correct to within a few meters at best). 

So, I would recommend doing the calculation on a sample size that is only as large as required for the level of accuracy your problem demands.  For example, the following will provide an estimate on two million rows that is good to 4 significant figures within only a few seconds.  You can increase the accuracy by increasing the sample size, but given the uncertainty in the GPS coordinates themselves, I doubt this is warranted.

sample.size=1e6    
lapply(split(df1[3:4], df1$province), 
  function(x) {
    s1 = x[sample(nrow(x), sample.size, T), ]
    s2 = x[sample(nrow(x), sample.size, T), ]
    mean(distHaversine(s1, s2))
  })


Some big data to test on:

N=1e6
df1 <- data.frame(
  province = c(rep(1,N),rep(2,N)),
  house = 1:(2*N),
  lat = c(rnorm(N,-76), rnorm(N,-85)), 
  lon = c(rnorm(N,39), rnorm(N,-55,2)))


To get a sense of the accuracy of this method, we can use bootstrapping.  For the following demo, I use just 100,000 rows of data so that we can perform 1000 bootstrap iterations in a short time:

N=1e5
df1 <- data.frame(lat = rnorm(N,-76,0.1), lon = rnorm(N,39,0.1))

dist.f = function(i) {
    s1 = df1[sample(N, replace = T), ]
    s2 = df1[sample(N, replace = T), ]
    mean(distHaversine(s1, s2))
    }

boot.dist = sapply(1:1000, dist.f)
mean(boot.dist)
# [1] 17580.63
sd(boot.dist)
# [1] 29.39302

hist(boot.dist, 20) 


I.e. for these test data, the mean distance is 17,580 +/- 29 m.  That is a coefficient of variation of 0.1%, which is likely accurate enough for most purposes.  As I said, you can get more accuracy by increasing the sample size if you really need to. 


    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它7个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复