Closest point to a path

后端 未结 3 1285
感动是毒
感动是毒 2020-12-10 18:01

I have two sets of points, called path and centers. For each point in path, I would like an efficient method for finding the ID of the

相关标签:
3条回答
  • 2020-12-10 18:44

    Here is an Rcpp11 solution. Something similar might work with Rcpp with a few changes.

    #define RCPP11_PARALLEL_MINIMUM_SIZE 1000
    #include <Rcpp11>
    
    inline double square(double x){
        return x*x ;
    }
    
    // [[Rcpp::export]]
    IntegerVector closest( DataFrame path, DataFrame centers ){
    
        NumericVector path_x = path["x"], path_y = path["y"] ;
        NumericVector centers_x = centers["x"], centers_y = centers["y"] ;
    
        int n_paths = path_x.size(), n_centers = centers_x.size() ; 
    
    
        IntegerVector ids = sapply( seq_len(n_paths), [&](int i){
                double px = path_x[i], py=path_y[i] ;
    
                auto get_distance = [&](int j){
                    return  square(px - centers_x[j]) + square(py-centers_y[j]) ;
                } ;
    
                double distance = get_distance(0) ;
                int res=0;
    
                for( int j=1; j<n_centers; j++){
                    double d = get_distance(j)  ;
                    if(d < distance){
                        distance = d ;
                        res = j ;
                    }
                }
    
                return res + 1 ;
        }) ;
    
        return unique(ids) ;
    
    }
    

    I get :

    > set.seed(1)
    
    > n <- 10000
    
    > x <- 100 * cumprod(1 + rnorm(n, 1e-04, 0.002))
    
    > y <- 50 * cumprod(1 + rnorm(n, 1e-04, 0.002))
    
    > path <- data.frame(cbind(x = x, y = y))
    
    > centers <- expand.grid(x = seq(0, 500, by = 0.5) +
    +     rnorm(1001), y = seq(0, 500, by = 0.2) + rnorm(2501))
    
    > system.time(closest(path, centers))
       user  system elapsed
     84.740   0.141  21.392
    

    This takes advantage of automatic parallelization of sugar, i.e. sapply is run in parallel. The #define RCPP11_PARALLEL_MINIMUM_SIZE 1000 part is to force the parallel, which is otherwise by default only kicked in from 10000. But in that case since the inner computation are time consuming, it's worth it.

    Note that you need a development version of Rcpp11 because unique is broken in the released version.

    0 讨论(0)
  • 2020-12-10 18:48

    You can do this with nn2 from the RANN package. On my system, this computes the nearest center to each of your path points in under 2 seconds.

    library(RANN)
    system.time(closest <- nn2(centers[, 1:2], path, 1))
    
    #   user  system elapsed 
    #   1.41    0.14    1.55 
    
    
    
    sapply(closest, head)
    
    #      nn.idx   nn.dists
    # [1,] 247451 0.20334929
    # [2,] 250454 0.12326323
    # [3,] 250454 0.28540127
    # [4,] 253457 0.05178687
    # [5,] 253457 0.13324137
    # [6,] 253457 0.09009626
    

    Here's another example with 2.5 million candidate points that all fall within the extent of the path points (in your example, the centers have a much larger x and y range than do the path points). It's a little slower in this case.

    set.seed(1)
    centers2 <- cbind(runif(2.5e6, min(x), max(x)), runif(2.5e6, min(y), max(y)))
    system.time(closest2 <- nn2(centers2, path, 1))
    
    #   user  system elapsed 
    #   2.96    0.11    3.07 
    
    sapply(closest2, head)
    
    #       nn.idx    nn.dists
    # [1,]  730127 0.025803703
    # [2,]  375514 0.025999069
    # [3,] 2443707 0.047259283
    # [4,]   62780 0.022747930
    # [5,] 1431847 0.002482623
    # [6,] 2199405 0.028815865
    

    This can be compared to the output using sp::spDistsN1 (which is much slower for this problem):

    library(sp)
    apply(head(path), 1, function(x) which.min(spDistsN1(centers, x)))
    
    #       1       2       3       4       5       6 
    #  730127  375514 2443707   62780 1431847 2199405 
    

    Adding the point id to the path data.frame and reducing to unique values is trivial:

    path$closest.id <- closest$nn.idx
    output <- unique(path$closest.id)
    
    0 讨论(0)
  • 2020-12-10 18:54

    This solution reduces processing time for the sample dataset by almost half that achieved by the RANN solution.

    It can be installed using devtools::install_github("thell/Rcppnanoflann")

    The Rcppnanoflann solution takes advantage of Rcpp, RcppEigen and the nanoflann EigenMatrixAdaptor along with the c++11 to yield identical unique indexes to the original question.

    library(Rcppnanoflann)
    system.time(o.nano<-nnIndex(centers,path))
    
    ##    user  system elapsed 
    ##    0.62    0.05    0.67
    

    * using path and centers values as defined in the original question

    To achieve identical results to the original sample the RANN solution needs slight modification which we time here...

    library(RANN)
    system.time(o.flann<-unique(as.numeric(nn2(centers,path,1)$nn.idx)))
    
    ##    user  system elapsed 
    ##    1.24    0.07    1.30
    

    identical(o.flann,o.nano)
    
    ## [1] TRUE
    

    The working function of Rcppnanoflann takes advantage of Eigen's Map capabilities to create the input for a fixed type Eigen matrix from the given P dataframe.

    Testing was done with the RcppParallel package but the kd_tree object does not have a copy constructor, so the tree needed to be created for each thread which ate up any gains in the parallel query processing.

    RcppEigen and Rcpp11 currently don't play together so the idea of using Rcpp11's parallel sapply for the query isn't easily tested.


    // [[Rcpp::export]]
    std::vector<double> nnIndex(const Rcpp::DataFrame & P, const Rcpp::DataFrame & Q )
    {
      using namespace Eigen;
      using namespace Rcpp;
      using namespace nanoflann;
    
      // Matrix of points to be queried against.
      const NumericVector & Px(P[0]);
      const NumericVector & Py(P[1]);
      MatrixX2d M(Px.size(), 2);
      M.col(0) = VectorXd::Map(&Px[0],Px.size());
      M.col(1) = VectorXd::Map(&Py[0],Py.size());
    
      // The points to query.
      const NumericVector & Qx(Q[0]);
      const NumericVector & Qy(Q[1]);
      double query_pt[2];
      size_t query_count(Qx.size());
    
      // Populate a 2d tree.
      KD_Tree kd_tree( 2, M, 10 );
      kd_tree.index->buildIndex();
    
      std::set<size_t> nn;
      std::vector<double> out;
      out.reserve(query_count);
    
      size_t index(0);
      double quadrance;
      for( size_t i=0 ; i < query_count; ++i ) {
        query_pt[0] = Qx[i];
        query_pt[1] = Qy[i];
        kd_tree.index->knnSearch( &query_pt[0],1, &index, &quadrance);
        if( nn.emplace(index).second ) out.emplace_back(index+1);
      }
    
      return out;
    }
    
    0 讨论(0)
提交回复
热议问题