ocean latitude longitude point distance from shore

前端 未结 1 1852
隐瞒了意图╮
隐瞒了意图╮ 2021-02-09 17:58

I started a \"free\" open-sourced project to create a new data set for pH of the earth oceans.

I started from the open data-set from NOAA and created a 2.45 millions row

1条回答
  •  旧巷少年郎
    2021-02-09 18:03

    So there are several things going on here. First, your dataset seems to have pH vs. depth. So while there are ~ 2.5MM rows, there are only ~200,000 rows with depth=0 - still a lot.

    Second, to get distance to nearest coast you need a shapefile of coastlines. Fortunately this is available here, at the excellent Natural Earth website.

    Third, your data is in long/lat (so, units = degrees), but you want distance in km, so you need to transform your data (the coastline data above is also in long/lat and also needs to be transformed). One problem with transformations is that your data is evidently global, and any global transformation will necessarily be non-planar. So the accuracy will depend on the actual location. The right way to do this is to grid your data and then use a set of planar transformations appropriate to whichever grid your points are in. This is beyond the scope of this question, though, so we'll use a global transformation (mollweide) just to give you an idea of how it's done in R.

    library(rgdal)   # for readOGR(...); loads package sp as well
    library(rgeos)   # for gDistance(...)
    
    setwd(" < directory with all your files > ")
    # WGS84 long/lat
    wgs.84    <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
    # ESRI:54009 world mollweide projection, units = meters
    # see http://www.spatialreference.org/ref/esri/54009/
    mollweide <- "+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
    df        <- read.csv("OSD_All.csv")
    sp.points <- SpatialPoints(df[df$Depth==0,c("Long","Lat")], proj4string=CRS(wgs.84))
    
    coast  <- readOGR(dsn=".",layer="ne_10m_coastline",p4s=wgs.84)
    coast.moll <- spTransform(coast,CRS(mollweide))
    point.moll <- spTransform(sp.points,CRS(mollweide))
    
    set.seed(1)   # for reproducible example
    test   <- sample(1:length(sp.points),10)  # random sample of ten points
    result <- sapply(test,function(i)gDistance(point.moll[i],coast.moll))
    result/1000   # distance in km
    #  [1]   0.2185196   5.7132447   0.5302977  28.3381043 243.5410571 169.8712255   0.4182755  57.1516195 266.0498881 360.6789699
    
    plot(coast)
    points(sp.points[test],pch=20,col="red")
    

    So this reads your dataset, extracts rows where Depth==0, and converts that to a SpatialPoints object. Then we read the coastlines database downloaded from the link above into a SpatialLines object. Then we transform both to the Mollweide projection using spTransform(...), then we use gDistance(...) in the rgeos package to calculate the minimum distance between each point and the nearest coast.

    Again, it is important to remember that despite all the decimal places, these distances are just approximate.

    One very big problem is speed: this process takes ~ 2 min for 1000 distances (on my system), so to run all 200,000 distances would take about 6.7 hours. One option, theoretically, would be to find a coastline database with a lower resolution.

    The code below will calculate all 201,000 distances.

    ## not run
    ## estimated run time ~ 7 hours
    result <- sapply(1:length(sp.points), function(i)gDistance(sp.points[i],coast))
    

    EDIT: OP's comment about the cores got me to thinking that this could be an instance where the improvement from parallelization might be worth the effort. So here is how you would run this (on Windows) using parallel processing.

    library(foreach)   # for foreach(...)
    library(snow)      # for makeCluster(...)
    library(doSNOW)    # for resisterDoSNOW(...)
    
    cl <- makeCluster(4,type="SOCK")  # create a 4-processor cluster
    registerDoSNOW(cl)                # register the cluster
    
    get.dist.parallel <- function(n) {
      foreach(i=1:n, .combine=c, .packages="rgeos", .inorder=TRUE, 
              .export=c("point.moll","coast.moll")) %dopar% gDistance(point.moll[i],coast.moll)
    }
    get.dist.seq <- function(n) sapply(1:n,function(i)gDistance(point.moll[i],coast.moll))
    
    identical(get.dist.seq(10),get.dist.parallel(10))  # same result?
    # [1] TRUE
    library(microbenchmark)  # run "benchmark"
    microbenchmark(get.dist.seq(1000),get.dist.parallel(1000),times=1)
    # Unit: seconds
    #                     expr       min        lq      mean    median        uq       max neval
    #       get.dist.seq(1000) 140.19895 140.19895 140.19895 140.19895 140.19895 140.19895     1
    #  get.dist.parallel(1000)  50.71218  50.71218  50.71218  50.71218  50.71218  50.71218     1
    

    Using 4 cores improves processing speed by about a factor of 3. So, since 1000 distances takes about a minute, 100,000 should take a little less than 2 hours.

    Note that using times=1 is an abuse of microbenchmark(...) really, as the whole point is to run the process multiple times and average the results, but I just didn't have the patience.

    0 讨论(0)
提交回复
热议问题