R: Distm for big data? Calculating minimum distances between two matrices

前端未结

关注

 2  616

南方客 2021-01-06 10:50

I have two matrices, one is 200K rows long, the other is 20K. For each row (which is a point) in the first matrix, I am trying to find which row (also a point) in the second

2条回答

广开言路 (楼主)

2021-01-06 11:17

This would use much less memory, as it does it one row at a time, rather than creating the full distance matrix (though it will be slower)

library(geosphere)
rnum <- apply(pixels.latlon, 1, function(x) {
                     dm <- distm(x, grwl.latlon, fun=distHaversine)
                     return(which.min(dm))
                     })

Much of the time is taken up with the complicated Haversine formula. As you are really only interested in finding the closest point, not in the exact distances, we could use a simpler distance measure. Here is an alternative using a formula based on this article http://jonisalonen.com/2014/computing-distance-between-coordinates-can-be-simple-and-fast/, and also using a quadratic approximation to the cosine (which is itself expensive to calculate)...

#quadratic cosine approximation using lm (run once)
qcos <- lm(y~x+I(x^2), data.frame(x=0:90, y=cos((0:90)*2*pi/360)))$coefficients
cosadj <- function(lat) qcos[1]+lat*(qcos[2]+qcos[3]*lat)

#define rough dist function
roughDist <- function(x,y){#x should be a single (lon,lat), y a (n*2) matrix of (lon,lat)
            latDev <- x[2]-y[,2]
            lonDev <- (x[1]-y[,1])*cosadj(abs(x[2]))
            return(latDev*latDev+lonDev*lonDev) #don't need the usual square root or any scaling parameters
            }

And then you can just replace Haversine with this new function...

rnum <- apply(pixels.latlon, 1, function(x) {
                     dm <- distm(x, grwl.latlon, fun=roughDist)
                     return(which.min(dm))
                     })

On my machine this runs about three times faster than the Haversine version.

0 讨论(0)

查看其它2个回答