Removing Spatial Outliers (lat and long coordinates) in R

问题

I've done my best to read up on this, and I think I've found the process that fits best, but if anyone else has any ideas or any functions or different methods for this it would be much appreciated. So I have a list of small data frames of different row lengths with each data frame containing several latitude and longitude coordinates in separate columns. For each item on the list separately, I need to remove a coordinate pair that may be an outlier and then find the mean center of the remaining coordinates (so there should be one coordinate pair for each item on the list in the end.

The way that I've read to do this is to find the mean center of all the lat and longs separately, and then calculate the euclidean distance from that mean center to each of the coordinate pairs and remove the point that's over a desired distance (let's say 100m). And then finally to calculate the mean center of the remaining points as the final outcome. This seems a bit convoluted to me though, so again, if anyone has any suggestions about coordinate outlier removal, that might be better.

Here's some code that I have so far:

dfList <- structure(list(`43` = structure(list(date = c("43 2011-04-06", "43 2011-04-07", "43 2011-04-08"), identifier = c(43, 43, 43), lon = c(-117.23041303, -117.23040817, -117.23039471), lat = c(32.81217294, 32.81218158, 32.81218645)), .Names = c("date", "identifier", "lon", "lat"), row.names = 13:15, class = "data.frame"), `44` = structure(list(date = c("44 2011-04-06", "44 2011-04-07", "44 2011-04-08"), identifier = c(44, 44, 44), lon = c(-117.22864227, -117.22861559, -117.22862265), lat = c(32.81257756, 32.81257089, 32.81257197)), .Names = c("date", "identifier", "lon", "lat"), row.names = 19:21, class = "data.frame"), `46` = structure(list(date = c("46 2011-04-06", "46 2011-04-07", "46 2011-04-08", "46 2011-04-09", "46 2011-04-10", "46 2011-04-11"), identifier = c(46, 46, 46, 46, 46, 46), lon = c(-117.22992617, -117.2289396895, -117.22965116, -117.23003928, -117.229922602, -117.22969664), lat = c(32.81295118, 32.8128226975, 32.81317299, 32.81224457, 32.813018734, 32.81276993)), .Names = c("date", "identifier", "lon", "lat"), row.names = 25:30, class = "data.frame"), `47` = structure(list(date = c("47 2011-04-06", "47 2011-04-07"), identifier = c(47, 47), lon = c(-117.2274484, -117.22747116), lat = c(32.81205838, 32.81207607)), .Names = c("date", "identifier", "lon", "lat"), row.names = 31:32, class = "data.frame")), .Names = c("43", "44", "46", "47"))

lonMean <- lapply(dfList, function(x) mean(x$lon)) #taking mean for longs
latMean <- lapply(dfList, function(x) mean(x$lat)) #taking mean for lats
latLon <- mapply(c, lonMean, latMean, SIMPLIFY=FALSE)#combining coord lists into one

EDIT: So what I need now is to create the distances between all coordinate for each item in the first list and the matching mean coordinate in the second list, and remove any points from the first list that have distances greater than 100. I've used dist and geodist (from the 'gmt') package before, but I'm not sure how to use them with these two lists. And then to further drop a possible outlier. Thanks so much for your help in advance, I'm not the most R savvy person, so any help much appreciated!

回答1:

Try this.

df <- do.call("rbind", dfList) # Flattens list into data frame, preserving 
                               # group identifier

# This function calculates distance in kilometers between two points
earth.dist <- function (long1, lat1, long2, lat2)
{
rad <- pi/180
a1 <- lat1 * rad
a2 <- long1 * rad
b1 <- lat2 * rad
b2 <- long2 * rad
dlon <- b2 - a2
dlat <- b1 - a1
a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))
R <- 6378.145
d <- R * c
return(d)
}

df$dist <- earth.dist(df$lon, df$lat, mean(df$lon), mean(df$lat))

df[df$dist >= 0.1,] # Filter those above 100m

来源：https://stackoverflow.com/questions/24439073/removing-spatial-outliers-lat-and-long-coordinates-in-r

标签

list

latitude-longitude

distance