I started a \"free\" open-sourced project to create a new data set for pH of the earth oceans.
I started from the open data-set from NOAA and created a 2.45 millions row
So there are several things going on here. First, your dataset seems to have pH vs. depth. So while there are ~ 2.5MM rows, there are only ~200,000 rows with depth=0 - still a lot.
Second, to get distance to nearest coast you need a shapefile of coastlines. Fortunately this is available here, at the excellent Natural Earth website.
Third, your data is in long/lat (so, units = degrees), but you want distance in km, so you need to transform your data (the coastline data above is also in long/lat and also needs to be transformed). One problem with transformations is that your data is evidently global, and any global transformation will necessarily be non-planar. So the accuracy will depend on the actual location. The right way to do this is to grid your data and then use a set of planar transformations appropriate to whichever grid your points are in. This is beyond the scope of this question, though, so we'll use a global transformation (mollweide) just to give you an idea of how it's done in R.
library(rgdal) # for readOGR(...); loads package sp as well
library(rgeos) # for gDistance(...)
setwd(" < directory with all your files > ")
# WGS84 long/lat
wgs.84 <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
# ESRI:54009 world mollweide projection, units = meters
# see http://www.spatialreference.org/ref/esri/54009/
mollweide <- "+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
df <- read.csv("OSD_All.csv")
sp.points <- SpatialPoints(df[df$Depth==0,c("Long","Lat")], proj4string=CRS(wgs.84))
coast <- readOGR(dsn=".",layer="ne_10m_coastline",p4s=wgs.84)
coast.moll <- spTransform(coast,CRS(mollweide))
point.moll <- spTransform(sp.points,CRS(mollweide))
set.seed(1) # for reproducible example
test <- sample(1:length(sp.points),10) # random sample of ten points
result <- sapply(test,function(i)gDistance(point.moll[i],coast.moll))
result/1000 # distance in km
# [1] 0.2185196 5.7132447 0.5302977 28.3381043 243.5410571 169.8712255 0.4182755 57.1516195 266.0498881 360.6789699
plot(coast)
points(sp.points[test],pch=20,col="red")
So this reads your dataset, extracts rows where Depth==0
, and converts that to a SpatialPoints object. Then we read the coastlines database downloaded from the link above into a SpatialLines object. Then we transform both to the Mollweide projection using spTransform(...)
, then we use gDistance(...)
in the rgeos
package to calculate the minimum distance between each point and the nearest coast.
Again, it is important to remember that despite all the decimal places, these distances are just approximate.
One very big problem is speed: this process takes ~ 2 min for 1000 distances (on my system), so to run all 200,000 distances would take about 6.7 hours. One option, theoretically, would be to find a coastline database with a lower resolution.
The code below will calculate all 201,000 distances.
## not run
## estimated run time ~ 7 hours
result <- sapply(1:length(sp.points), function(i)gDistance(sp.points[i],coast))
EDIT: OP's comment about the cores got me to thinking that this could be an instance where the improvement from parallelization might be worth the effort. So here is how you would run this (on Windows) using parallel processing.
library(foreach) # for foreach(...)
library(snow) # for makeCluster(...)
library(doSNOW) # for resisterDoSNOW(...)
cl <- makeCluster(4,type="SOCK") # create a 4-processor cluster
registerDoSNOW(cl) # register the cluster
get.dist.parallel <- function(n) {
foreach(i=1:n, .combine=c, .packages="rgeos", .inorder=TRUE,
.export=c("point.moll","coast.moll")) %dopar% gDistance(point.moll[i],coast.moll)
}
get.dist.seq <- function(n) sapply(1:n,function(i)gDistance(point.moll[i],coast.moll))
identical(get.dist.seq(10),get.dist.parallel(10)) # same result?
# [1] TRUE
library(microbenchmark) # run "benchmark"
microbenchmark(get.dist.seq(1000),get.dist.parallel(1000),times=1)
# Unit: seconds
# expr min lq mean median uq max neval
# get.dist.seq(1000) 140.19895 140.19895 140.19895 140.19895 140.19895 140.19895 1
# get.dist.parallel(1000) 50.71218 50.71218 50.71218 50.71218 50.71218 50.71218 1
Using 4 cores improves processing speed by about a factor of 3. So, since 1000 distances takes about a minute, 100,000 should take a little less than 2 hours.
Note that using times=1
is an abuse of microbenchmark(...)
really, as the whole point is to run the process multiple times and average the results, but I just didn't have the patience.