I want to calculate the average geographical distance between a number of houses per province.
Suppose I have the following data.
df1 <- data.fram
Given that your data has millions of rows, this sounds like an "XY" problem. I.e. the answer you really need is not the answer to the question you asked.
Let me give an analogy: if you want to know the average height of trees in a forest you do not measure every tree. You just measure a large enough sample to ensure that your estimate has a high enough probability of being as close to the true average as you need.
Performing a brute force calculation using the distance from every house to every other house will not only take excessive resources (even with optimised code), but also it will provide far more decimal places than you could possibly need, or are justified by the data accuracy (GPS coordinates are typically only correct to within a few meters at best).
So, I would recommend doing the calculation on a sample size that is only as large as required for the level of accuracy your problem demands. For example, the following will provide an estimate on two million rows that is good to 4 significant figures within only a few seconds. You can increase the accuracy by increasing the sample size, but given the uncertainty in the GPS coordinates themselves, I doubt this is warranted.
sample.size=1e6
lapply(split(df1[3:4], df1$province),
function(x) {
s1 = x[sample(nrow(x), sample.size, T), ]
s2 = x[sample(nrow(x), sample.size, T), ]
mean(distHaversine(s1, s2))
})
Some big data to test on:
N=1e6
df1 <- data.frame(
province = c(rep(1,N),rep(2,N)),
house = 1:(2*N),
lat = c(rnorm(N,-76), rnorm(N,-85)),
lon = c(rnorm(N,39), rnorm(N,-55,2)))
To get a sense of the accuracy of this method, we can use bootstrapping. For the following demo, I use just 100,000 rows of data so that we can perform 1000 bootstrap iterations in a short time:
N=1e5
df1 <- data.frame(lat = rnorm(N,-76,0.1), lon = rnorm(N,39,0.1))
dist.f = function(i) {
s1 = df1[sample(N, replace = T), ]
s2 = df1[sample(N, replace = T), ]
mean(distHaversine(s1, s2))
}
boot.dist = sapply(1:1000, dist.f)
mean(boot.dist)
# [1] 17580.63
sd(boot.dist)
# [1] 29.39302
hist(boot.dist, 20)
I.e. for these test data, the mean distance is 17,580 +/- 29 m. That is a coefficient of variation of 0.1%, which is likely accurate enough for most purposes. As I said, you can get more accuracy by increasing the sample size if you really need to.