Geographical distance by group - Applying a function on each pair of rows

前端 未结 7 843
清歌不尽
清歌不尽 2020-12-21 05:04

I want to calculate the average geographical distance between a number of houses per province.

Suppose I have the following data.

df1 <- data.fram         


        
7条回答
  •  隐瞒了意图╮
    2020-12-21 05:40

    My 10 cents. You can:

    # subset the province
    df1 <- df1[which(df1$province==1),]
    
    # get all combinations
    all <- combn(df1$house, 2, FUN = NULL, simplify = TRUE)
    
    # run your function and get distances for all combinations
    distances <- c()
    for(col in 1:ncol(all)) {
      a <- all[1, col]
      b <- all[2, col]
      dist <- distm(c(df1$lon[a], df1$lat[a]), c(df1$lon[b], df1$lat[b]), fun = distHaversine)
      distances <- c(distances, dist)
      }
    
    # calculate mean:
    mean(distances)
    # [1] 15379.21
    

    This gives you the mean value for the province, which you can compare with the results of other methods. For example sapply which was mentioned in the comments:

    df1 <- df1[which(df1$province==1),]
    mean(sapply(split(df1, df1$province), dist))
    # [1] 1.349036
    

    As you can see, it gives different results, cause dist function can calculate the distances of different type (like euclidean) and cannot do haversine or other "geodesic" distances. The package geodist seems to have options which could bring you closer to sapply:

    library(geodist)
    library(magrittr)
    
    # defining the data
    df1 <- data.frame(province = c(1, 1, 1, 2, 2, 2),
                      house = c(1, 2, 3, 4, 5, 6),
                      lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7), 
                      lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))
    
    # defining the function 
    give_distance <- function(resultofsplit){
      distances <- c()
      for (i in 1:length(resultofsplit)){
        sdf <- resultofsplit
        sdf <- sdf[[i]]
        sdf <- sdf[c("lon", "lat", "province", "house")]
    
        sdf2 <- as.matrix(sdf)
        sdf3 <- geodist(x=sdf2, measure="haversine")
        sdf4 <- unique(as.vector(sdf3))
        sdf4 <- sdf4[sdf4 != 0]        # this is to remove the 0-distances 
        mean_dist <- mean(sdf4)
        distances <- c(distances, mean_dist)
        }  
        return(distances)
    }
    
    split(df1, df1$province) %>% give_distance()
    #[1]  15379.21 793612.04
    

    E.g. the function will give you the mean distance values for each province. Now, I didn´t manage to get give_distance work with sapply, but this should already be more efficient.

提交回复
热议问题