How can I speed up spatial operations in `dplyr::mutate()`?

匿名 (未验证) 提交于 2019-12-03 08:46:08

问题:

I am working on a spatial problem using the sf package in conjunction with dplyr and purrr.

I would prefer to perform spatial operations inside a mutate call, like so:

simple_feature %>%   mutate(geometry_area = map_dbl(geometry, ~ as.double(st_area(.x)))) 

I like that this approach allows me to run a series of spatial operations using %>% and mutate.

I dislike that this approach seems to significantly increase the run-time of the sf functions (sometimes prohibitively) and I would appreciate hearing suggestions about how to overcome this speed loss.

Here is a reprex that illustrates the speed loss problem in detail.


Please note: this is not a minimal example and requires downloading a few packages and one file from an ESRI REST API. I hope you'll be kind with me ;)

The objective in this example is to add a new column indicating whether each North Carolina county (nc) intersects with any of the waterbodies polygons (nc_wtr), as shown in the image below:

I created a function that performs this calculation: st_intersects_any()

Then I benchmark that function on two datasets (nc and nc_1e4), first using st_intersects_any() by itself and then using it inside a mutate call.

## |TEST               |  ELAPSED| ## |:------------------|--------:| ## |bm_sf_small        |     0.01| ## |bm_sf_dplyr_small  |     1.22| ## |bm_sf_large        |     0.95| ## |bm_sf_dplyr_large  |   122.88| 

The benchmarks clearly show that the dplyr approach is substantially slower, and I'm hoping that someone has a suggestion for reducing or eliminating this speed loss while still using the dplyr approach.

If there are significantly faster ways to do this using data.table or some other method that I should check out please let me know about those as well.

Thanks!

Reprex

# Setup ----  library(lwgeom) # devtools::install_github('r-spatial/lwgeom)  library(tidyverse)  library(sf)  library(esri2sf) # devtools::install_github('yonghah/esri2sf') library(rbenchmark)  library(knitr)  # Create the new sf function: st_intersects_any ----  st_intersects_any <- function(x, y) {   st_intersects(x, y) %>%     map_lgl(~ length(.x) > 0) }  # Load data ---- # NC counties  nc <- read_sf(system.file("shape/nc.shp", package = "sf")) %>%   st_transform(32119)  nc_1e4 <- list(nc) %>%   rep(times = 1e2) %>%   reduce(rbind)  # NC watersheds  url <- "https://services.nconemap.gov/secure/rest/services/NC1Map_Watersheds/MapServer/2"  nc_wtr <- esri2sf(url) ## Warning: package 'httr' was built under R version 3.4.2 ##  ## Attaching package: 'jsonlite' ## The following object is masked from 'package:purrr': ##  ##     flatten ## [1] "Feature Layer" ## [1] "esriGeometryPolygon"  nc_wtr <- st_transform(nc_wtr, 32119) %>%   st_simplify(dTolerance = 100) # simplify the waterbodies geometries  # plot the data  par(mar = rep(.1, 4)) plot(st_geometry(nc), lwd = 1) plot(st_geometry(nc_wtr), col = alpha("blue", .3), lwd = 1.5, add = TRUE)  # Benchmark the two approaches  cols <- c("elapsed", "relative")  bm_sf_small <- benchmark({   st_intersects_any(nc, nc_wtr) }, columns = cols, replications = 1)  bm_sf_dplyr_small <- benchmark({   nc %>% transmute(INT = map_lgl(geometry, st_intersects_any, y = nc_wtr)) }, columns = cols, replications = 1) ## Warning: package 'bindrcpp' was built under R version 3.4.2  bm_sf_large <- benchmark({   st_intersects_any(nc_1e4, nc_wtr) }, columns = cols, replications = 1)  bm_sf_dplyr_large <- benchmark({   nc_1e4 %>% transmute(INT = map_lgl(geometry, st_intersects_any, y = nc_wtr)) }, columns = cols, replications = 1)  tests <- list(bm_sf_small, bm_sf_dplyr_small, bm_sf_large, bm_sf_dplyr_large)  tbl <- tibble(   TEST = c("bm_sf_small", "bm_sf_dplyr_small", "bm_sf_large", "bm_sf_dplyr_large"),   ELAPSED = map_dbl(tests, "elapsed") )  kable(tbl,format = "markdown", padding = 2)  ## |TEST               |  ELAPSED| ## |:------------------|--------:| ## |bm_sf_small        |     0.01| ## |bm_sf_dplyr_small  |     1.22| ## |bm_sf_large        |     0.95| ## |bm_sf_dplyr_large  |   122.88|      devtools::session_info() ## Session info ------------------------------------------------------------- ##  setting  value                        ##  version  R version 3.4.0 (2017-04-21) ##  system   x86_64, mingw32              ##  ui       RTerm                        ##  language (EN)                         ##  collate  English_United States.1252   ##  tz       America/Los_Angeles          ##  date     2018-01-31 ## Packages ----------------------------------------------------------------- ##  package    * version     date       source                             ##  assertthat   0.2.0       2017-04-11 CRAN (R 3.4.2)                     ##  backports    1.1.0       2017-05-22 CRAN (R 3.4.0)                     ##  base       * 3.4.0       2017-04-21 local                              ##  bindr        0.1         2016-11-13 CRAN (R 3.4.2)                     ##  bindrcpp   * 0.2         2017-06-17 CRAN (R 3.4.2)                     ##  broom        0.4.3       2017-11-20 CRAN (R 3.4.3)                     ##  cellranger   1.1.0       2016-07-27 CRAN (R 3.4.2)                     ##  class        7.3-14      2015-08-30 CRAN (R 3.4.0)                     ##  classInt     0.1-24      2017-04-16 CRAN (R 3.4.2)                     ##  cli          1.0.0       2017-11-05 CRAN (R 3.4.2)                     ##  colorspace   1.3-2       2016-12-14 CRAN (R 3.4.2)                     ##  compiler     3.4.0       2017-04-21 local                              ##  crayon       1.3.4       2017-10-30 Github (r-lib/crayon@b5221ab)      ##  curl         3.0         2017-10-06 CRAN (R 3.4.2)                     ##  datasets   * 3.4.0       2017-04-21 local                              ##  DBI          0.7         2017-06-18 CRAN (R 3.4.2)                     ##  devtools     1.13.2      2017-06-02 CRAN (R 3.4.0)                     ##  digest       0.6.13      2017-12-14 CRAN (R 3.4.3)                     ##  dplyr      * 0.7.4       2017-09-28 CRAN (R 3.4.2)                     ##  e1071        1.6-8       2017-02-02 CRAN (R 3.4.2)                     ##  esri2sf    * 0.1.0       2017-12-12 Github (yonghah/esri2sf@81d211f)   ##  evaluate     0.10.1      2017-06-24 CRAN (R 3.4.3)                     ##  forcats    * 0.2.0       2017-01-23 CRAN (R 3.4.3)                     ##  foreign      0.8-67      2016-09-13 CRAN (R 3.4.0)                     ##  ggplot2    * 2.2.1.9000  2017-12-02 Github (tidyverse/ggplot2@7b5c185) ##  glue         1.2.0.9000  2018-01-13 Github (tidyverse/glue@1592ee1)    ##  graphics   * 3.4.0       2017-04-21 local                              ##  grDevices  * 3.4.0       2017-04-21 local                              ##  grid         3.4.0       2017-04-21 local                              ##  gtable       0.2.0       2016-02-26 CRAN (R 3.4.2)                     ##  haven        1.1.0       2017-07-09 CRAN (R 3.4.2)                     ##  hms          0.4.0       2017-11-23 CRAN (R 3.4.3)                     ##  htmltools    0.3.6       2017-04-28 CRAN (R 3.4.0)                     ##  httr       * 1.3.1       2017-08-20 CRAN (R 3.4.2)                     ##  jsonlite   * 1.5         2017-06-01 CRAN (R 3.4.0)                     ##  knitr        1.18        2017-12-27 CRAN (R 3.4.3)                     ##  lattice      0.20-35     2017-03-25 CRAN (R 3.4.0)                     ##  lazyeval     0.2.1       2017-10-29 CRAN (R 3.4.2)                     ##  lubridate    1.7.1       2017-11-03 CRAN (R 3.4.2)                     ##  lwgeom     * 0.1-1       2017-12-16 Github (r-spatial/lwgeom@baf22c6)  ##  magrittr     1.5         2014-11-22 CRAN (R 3.4.0)                     ##  memoise      1.1.0       2017-04-21 CRAN (R 3.4.0)                     ##  methods    * 3.4.0       2017-04-21 local                              ##  mnormt       1.5-5       2016-10-15 CRAN (R 3.4.1)                     ##  modelr       0.1.1       2017-07-24 CRAN (R 3.4.2)                     ##  munsell      0.4.3       2016-02-13 CRAN (R 3.4.2)                     ##  nlme         3.1-131     2017-02-06 CRAN (R 3.4.0)                     ##  parallel     3.4.0       2017-04-21 local                              ##  pillar       1.0.99.9001 2018-01-16 Github (r-lib/pillar@9d96835)      ##  pkgconfig    2.0.1       2017-03-21 CRAN (R 3.4.2)                     ##  plyr         1.8.4       2016-06-08 CRAN (R 3.4.2)                     ##  psych        1.7.8       2017-09-09 CRAN (R 3.4.2)                     ##  purrr      * 0.2.4.9000  2017-12-05 Github (tidyverse/purrr@62b135a)   ##  R6           2.2.2       2017-06-17 CRAN (R 3.4.0)                     ##  rbenchmark * 1.0.0       2012-08-30 CRAN (R 3.4.1)                     ##  Rcpp         0.12.15     2018-01-20 CRAN (R 3.4.3)                     ##  readr      * 1.1.1       2017-05-16 CRAN (R 3.4.2)                     ##  readxl       1.0.0       2017-04-18 CRAN (R 3.4.2)                     ##  reshape2     1.4.2       2016-10-22 CRAN (R 3.4.2)                     ##  rlang        0.1.6       2017-12-21 CRAN (R 3.4.3)                     ##  rmarkdown    1.8         2017-11-17 CRAN (R 3.4.2)                     ##  rprojroot    1.3-2       2018-01-03 CRAN (R 3.4.3)                     ##  rvest        0.3.2       2016-06-17 CRAN (R 3.4.2)                     ##  scales       0.5.0.9000  2017-12-02 Github (hadley/scales@d767915)     ##  sf         * 0.6-1       2018-01-24 Github (r-spatial/sf@7ea67a5)      ##  stats      * 3.4.0       2017-04-21 local                              ##  stringi      1.1.6       2017-11-17 CRAN (R 3.4.2)                     ##  stringr    * 1.2.0       2017-02-18 CRAN (R 3.4.0)                     ##  tibble     * 1.4.1.9000  2018-01-18 Github (tidyverse/tibble@64fedbd)  ##  tidyr      * 0.7.2.9000  2018-01-13 Github (tidyverse/tidyr@74bd48f)   ##  tidyverse  * 1.2.1       2017-11-14 CRAN (R 3.4.3)                     ##  tools        3.4.0       2017-04-21 local                              ##  udunits2     0.13        2016-11-17 CRAN (R 3.4.1)                     ##  units        0.5-1       2018-01-08 CRAN (R 3.4.3)                     ##  utf8         1.1.3       2018-01-03 CRAN (R 3.4.3)                     ##  utils      * 3.4.0       2017-04-21 local                              ##  withr        2.1.1.9000  2018-01-13 Github (jimhester/withr@df18523)   ##  xml2         1.1.1       2017-01-24 CRAN (R 3.4.2)                     ##  yaml         2.1.14      2016-11-12 CRAN (R 3.4.0) 

回答1:

you can considerably speed-up this by simply dropping the unnecessary map_lgl call in the pipe:

bm_sf_dplyr_large_fast <- benchmark({   int_new <- nc_1e4 %>% mutate(INT = st_intersects_any(., nc_wtr)) }, columns = cols, replications = 1) bm_sf_dplyr_large_fast  # bm_sf_dplyr_large_fast # elapsed relative # 1   0.829        1 

The huge slow down depends from the fact that mapping over geometry rows is in this case detrimental, because you then do a looped one-to-multi polygon intersection.

Besides the overhead introduced by subsetting, I believe this is much slower than a straight-on multi-to-multi because you are probably mostly losing the "spatial indexing" capabilities of sf objects, which considerably speed-up intersect operations (see http://r-spatial.org/r/2017/06/22/spatial-index.html). (Also note that I substituted transmute' withmutate` - also that was introducing some overhead).

HTH



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!