Optimisation tips to find in which triangle a point belongs

问题

I'm actually having some troubles optimising my algorithm:

I have a disk (centered in 0, with radius 1) filled with triangles (not necessarily of same area/length). There could be a HUGE amount of triangle (let's say from 1k to 300k triangles)

My goal is to find as quick as possible in which triangle a point belongs. The operation has to be repeated a large amount of time (around 10k times).

For now the algorithm I'm using is: I'm computing the barycentric coordinates of the point in each triangle. If the first coefficient is between 0 and 1, I continue. If it's not, I stop. Then I compute the second coefficient with the same idea, and the third, and I do this for every triangle.

I can't think of a way to use the fact that I'm working on a disc (and the fact that I have an Euclidean distance to help me "target" the good triangles directly): If I try to compute the distance from my point to every "center" of triangles :

1) it's already more operations than what I'm doing when I brute force it with barycentric coordinates

2) I will have to order a vector containing the Euclidean distances of all triangles to my point.

3) I have absolutely no guarantee that the closest triangle to my point will be the good triangle.

I feel like I'm missing something, and that I could pre-compute, something to help me spot the good "area" before starting the brute force part.

The algorithm is already parallelised (using OpenMP): I'm calling the below function on a parallel for.

bool Triangle2D::is_in_triangle(Vector2d Point) {
    double denominator = ((Tri2D(1, 1) - Tri2D(2, 1))*(Tri2D(0, 0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Tri2D(0, 1) - Tri2D(2, 1)));
    // Computing the first coefficient
    double a = ((Tri2D(1, 1) - Tri2D(2, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
    if (a < 0 || a>1) {
        return(false);
    }
    // Computing the second coefficient
    double b = ((Tri2D(2, 1) - Tri2D(0, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(0, 0) - Tri2D(2, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
    if (b < 0 || b>1) {
        return(false);
    }
    // Computing the third coefficient
    double c = 1 - a - b;
    if (c < 0 || c>1) {
        return(false);
    }
    return(true);
}

Next step is probably to look at GPU parallelisation, but I need to make sure that the idea behind the code is good enough.

For now it takes approximately 2min30 for 75k triangles and 10k points, but this isn't fast enough.

Edit:
Triangle2D uses Eigen matrix to store coordinates

回答1:

All long-bearded HPC-professionals, kindly do permit a bit scholastically elaborated approach here, which may ( in my honest opinion ) become interesting, if not enjoyed by, for our Community Members, that feel themselves a bit more junior than you professionally feel yourselves and who may get interested in a bit deeper look into performance-motivated code-design, performance tweaking and other parallel-code risks and benefits, that you know on your own hard-core HPC-computing experience so well and so deep. Thank you.

a) ALGORITHM (as-is) can get ~2X speedup a low-hanging fruit+more surprises yet2come

b) OTHER ALGORITHM may get ~40~80X speedup boost due2geometry

c) TIPS FOR THE BEST PARALLEL CODE + ULTIMATE PERFORMANCE

GOAL : A target runtime for 10k points in 300k triangles would be 2-3min on a computer with an i5 8500, 3GHz, 6core, NVIDIA Quadro P400 (have to try GPU computing, not even sure if it's worth it)

While this may seem as a long journey, the problem is nice and deserves a bit closer look, so please, bear with me during the flow of utmost-performance motivated thinking.

a) ALGORITHM (as-is) ANALYSIS:

The as-is use of Barycentric coordinate system is a nice trick, the straight implementation of which costs a bit more than about (20 FLOPs + 16 MEM/REG-I/O-ops) in best case and slightly above (30 FLOPs + 30 MEM/REG-I/O-ops).

There are a few polishing touches, that may reduce these execution costs down right by avoiding some expensive and even not important operations from ever taking place:

--------------------------------------------------------------------------------------
double denominator = ( ( Tri2D( 1, 1 )
                       - Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB
                         ) * ( Tri2D( 0, 0 ) //---------------------        + OP-3.MUL
                             - Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB
                               ) + ( Tri2D( 2, 0 ) //---------------        + OP-7.ADD
                                   - Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB
                                     ) * ( Tri2D( 0, 1 ) //---------        + OP-6.MUL
                                         - Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
                                           )
                       );
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
double           a = ( ( Tri2D( 1, 1 )
                       - Tri2D( 2, 1 )  //-------------------------- 2x MEM + OP-8.SUB
                         ) * ( Point(0)   //------------------------        + OP-A.MUL
                             - Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-9.SUB
                               ) + ( Tri2D( 2, 0 ) //---------------        + OP-E.ADD
                                   - Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB
                                     ) * ( Point(1) //--------------        + OP-D.MUL
                                         - Tri2D( 2, 1 ) //--------- 2x MEM + OP-C.MUL
                                           )
                       ) / denominator; //-------------------------- 1x REG + OP-F.DIV //----------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever------------[3]
if (a < 0 || a>1) { // ----------------------------------------------------------------------------- a < 0 ~~    ( sign( a ) * sign( denominator ) ) < 0
    return(false); // ------------------------------------------------------------------------------ a > 1 ~~  ||        a >         denominator
}
// Computing the second coefficient
double           b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //--------- 2x MEM + OP-16.SUB
                     * (      Point(0) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-17.SUB + OP-18.MUL
                     + ( Tri2D( 0, 0 ) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-19.SUB + OP-22.ADD
                     * (      Point(1) - Tri2D( 2, 1 ) ) //--------- 2x MEM + OP-20.SUB + OP-21.MUL
                       ) / denominator; //-------------------------- 1x REG + OP-23.DIV //---------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever -----------[3]
if (b < 0 || b>1) { // ----------------------------------------------------------------------------- b < 0 ~~   ( sign( b ) * sign( denominator ) ) < 0
    return(false); // ------------------------------------------------------------------------------ b > 1 ~~ ||        b >         denominator
}
// Computing the third coefficient
double c = 1 - a - b; // ------------------------------------------- 2x REG + OP-24.SUB + OP-25.SUB
//         1 -(a - b)/denominator; //--------------------------------------------------------------- MAY DEFER THE MOST EXPENSIVE DIVISION EXECUTED BUT HERE, IFF INDEED FIRST NEEDED <---HERE <----------[3]

repeated re-evaluations, that appear in the original may get explicitly crafted out by manual assign/re-use, yet, there is a chance a good optimising compiler may get these evicted within a use of -O3 enforced optimisation flag.
do not hesitate to profile even this lowest-hanging fruit, to polish the most expensive parts.

//------------------------------------------------------------------
double Tri2D_11_sub_21 = ( Tri2D( 1, 1 )
                         - Tri2D( 2, 1 )
                           ), //====================================================== 2x MEM + OP-a.SUB (REG re-used 2x)
       Tri2D_20_sub_10 = ( Tri2D( 2, 0 )
                         - Tri2D( 1, 0 )
                           ), //====================================================== 2x MEM + OP-b.SUB (REG re-used 2x)
       Tri2D_00_sub_20 = ( Tri2D( 0, 0 )
                         - Tri2D( 2, 0 )
                           ); //====================================================== 2x MEM + OP-c.SUB (REG re-used 1~2x)
//-----------------------
double denominator = ( ( /*
                         Tri2D( 1, 1 )
                       - Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB (avoided by re-use) */
                         Tri2D_11_sub_21 //=========================================== 1x REG + OP-d.MUL
                         ) * ( /*
                               Tri2D( 0, 0 ) //---------------------        + OP-3.MUL
                             - Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB (avoided by re-use) */
                               Tri2D_00_sub_20 //===================================== 1x REG + OP-f.ADD
                               ) + ( /*
                                     Tri2D( 2, 0 ) //---------------        + OP-7.ADD
                                   - Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB (avoided by re-use) */
                                     Tri2D_20_sub_10 //=============================== 1x REG + OP-e.MUL
                                     ) * ( Tri2D( 0, 1 ) //---------        + OP-6.MUL
                                         - Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
                                           )
                       );
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
//
double enumer_of_a = ( ( /*
                         Tri2D( 1, 1 )
                       - Tri2D( 2, 1 )  //-------------------------- 2x MEM + OP-8.SUB (avoided by re-use) */
                         Tri2D_11_sub_21 //=========================================== 1x REG + OP-g.MUL
                         ) * ( Point(0)   //------------------------------------------        + OP-i.MUL
                             - Tri2D( 2, 0 ) //--------------------------------------- 2x MEM + OP-h.SUB
                               ) + ( /*
                                     Tri2D( 2, 0 ) //---------------        + OP-E.ADD
                                   - Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB (avoided by re-use) */
                                     Tri2D_20_sub_10 //=============================== 1x REG + OP-l.ADD
                                     ) * ( Point(1) //--------------------------------        + OP-k.MUL
                                         - Tri2D( 2, 1 ) //--------------------------- 2x MEM + OP-j.MUL
                                           )
                       );/*denominator; *///------------------------ 1x REG + OP-F.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A CHEAPEST EVER J.I.T./non-MISRA-C-RET-->
//
if (  enumer_of_a > denominator                          // in a > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator         a rather expensive .FDIV is avoided at all
   || enumer_of_a * denominator < 0 ) return(false);     // in a < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0

//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the second coefficient
//
double enumer_of_b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //---------------------------------------- 2x MEM + OP-m.SUB
                     * (      Point(0) - Tri2D( 2, 0 ) ) //---------------------------------------- 2x MEM + OP-n.SUB + OP-o.MUL
                     + ( /*
                         Tri2D( 0, 0 ) - Tri2D( 2, 0 )   //--------- 2x MEM + OP-19.SUB + OP-22.ADD (avoided by re-use) */
                         Tri2D_00_sub_20 //======================================================== 1x REG + OP-p.ADD
                         )
                     * (      Point(1) - Tri2D( 2, 1 ) ) //---------------------------------------- 2x MEM + OP-r.SUB + OP-q.MUL
                       );/*denominator; *///------------------------ 1x REG + OP-23.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A 2nd CHEAPEST J.I.T./non-MISRA-C-RET-->
//
if (  enumer_of_b > denominator                          // in b > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator         a rather expensive .FDIV is avoided at all
   || enumer_of_b * denominator < 0 ) return(false);     // in b < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0

//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the third coefficient
//
double c = 1 - ( ( enumer_of_a
                 - enumer_of_b
                   )
                 / denominator
                 ); // --------------------------------------------- 3x REG + OP-s.SUB + OP-t.FDIC + OP-u.SUB <----THE MOST EXPENSIVE .FDIV BUT HERE, IFF INDEED FIRST NEEDED <---HERE <------------[3]

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR PRE-FINAL RET J.I.T./non-MISRA-C-RET-->
//
if (   c < 0 || c > 1 ) return( false );

return( true ); //~~~~~~~~~~~~~~~~~~~ "the-last-resort" RET-->

b) OTHER APPROACH TO THE ALGORITHM:

Let's review another approach, that seems both faster and a way cheaper, due to less data and lest instructions and is also promising to have high density, once smart use of vectorised AVX-2 or better AVX-512 vector-instructions will get harnessed per each core : ANIMATED, fully-INTERACTIVE explanation is altogether with analytical problem re-formulation here.

A triple-test of a point-to-line distance (each line being ax + by + c = 0) comes at cheap cost of ~ 2 FMA3 are enough + a sign test (and even better if vector-4-in-1-compact AVX-2 / 8-in-1 AVX-512 VFMADD-s)

While there may be possible to "fast" decide on whether a point makes sense to be tested against a respective triangle, possibly with statically pre-"framing" each triangle in polar(R,Theta) coordinate space by a static, precomputed tuple of ( R_min, R_max, Theta_min, Theta_max ) and "fast" discriminate each point, if it does not fit inside such a polar-segment. Yet the costs of doing this (data-access pattern costs + the costs of these "fast" instructions) will grow beyond any potentially "saved" instruction-paths, that need not take place (if a point is found outside a polar-segment). Having achieved a range of performance of 24 points-in-triangle tests per a cost of ~9 CPU-instructions per 6 CPU-cores @ 3.0+ GHz, the polar-segment "pre-testing" will suddenly become prohibitively expensive, not speaking about a second order negative effect ( introduced by a way worse cache-hit / cache-miss ratios, given more data is to be stored and read into a "fast"-pre-test ~ +16B per a triangle "framing" polar-segment tuple +8B per point ( with the worst impact on the cache hit/miss-ratio ).

That is clearly not good direction for any further move as the performance will get decreased, not increased, which is our global strategy here.

Intel i5 8500 CPU can use but AVX-2, so the most compact use of 8-triangles per CPU-clock tick per core is left for achieving even 2X higher performance, if needed.

TRIPLE-"point-above-line"-TEST per POINT per TRIANGLE:
---------------------------------------------------------------------------------
PRE-COMPUTE STATIC per TRIANGLE CONSTANTS:
                     LINE_1: C0__L1, C1__L1, C2__L1, bool_L1DistanceMustBePOSITIVE
                     LINE_2: C0__L2, C1__L2, C2__L2, bool_L2DistanceMustBePOSITIVE
                     LINE_3: C0__L3, C1__L3, C2__L3, bool_L3DistanceMustBePOSITIVE

TEST per TRIANGLE per POINT (Px,Py) - best executed in an AVX-vectorised fashion
                     LINE_1_______________________________________________________________
                     C0__L1 ( == pre-calc'd CONST = c1 / sqrt( a1^2 + b1^2 ) ) //                       
                Px * C1__L1 ( == pre-calc'd CONST = a1 / sqrt( a1^2 + b1^2 ) ) // OP-1.FMA REG-Px,C1__L1,C0__L1
                Py * C2__L1 ( == pre-calc'd CONST = b1 / sqrt( a1^2 + b1^2 ) ) // OP-2.FMA REG-Py,C2__L1, +
           .GT./.LT. 0                                                         // OP-3.SIG
                     LINE_2_______________________________________________________________
                     C0__L2 ( == pre-calc'd CONST = c2 / sqrt( a2^2 + b2^2 ) ) //                       
                Px * C1__L2 ( == pre-calc'd CONST = a2 / sqrt( a2^2 + b2^2 ) ) // OP-4.FMA REG-Px,C1__L2,C0__L2
                Py * C2__L2 ( == pre-calc'd CONST = b2 / sqrt( a2^2 + b2^2 ) ) // OP-5.FMA REG-Py,C2__L2, +
           .GT./.LT. 0                                                         // OP-6.SIG
                     LINE_3_______________________________________________________________
                     C0__L3 ( == pre-calc'd CONST = c3 / sqrt( a3^2 + b3^2 ) ) //                       
                Px * C1__L3 ( == pre-calc'd CONST = a3 / sqrt( a3^2 + b3^2 ) ) // OP-7.FMA REG-Px,C1__L3,C0__L3
                Py * C2__L3 ( == pre-calc'd CONST = b3 / sqrt( a3^2 + b3^2 ) ) // OP-8.FMA REG-Py,C2__L3, +
           .GT./.LT. 0                                                         // OP-9.SIG

( using AVX-2 intrinsics or inlined assembler will deliver highest performance due to COMPACT 4-in-1 VFMADDs )
                                             ____________________________________________
                                            |  __________________________________________triangle A: C1__L1 
                                            | |  ________________________________________triangle B: C1__L1
                                            | | |  ______________________________________triangle C: C1__L1
                                            | | | |  ____________________________________triandle D: C1__L1
                                            | | | | |
                                            | | | | |      ______________________________
                                            | | | | |     |  ____________________________triangle A: Px
                                            | | | | |     | |  __________________________triangle B: Px
                                            | | | | |     | | |  ________________________triangle C: Px
                                            | | | | |     | | | |  ______________________triandle D: Px
                                            | | | | |     | | | | |
                                            |1|2|3|4|     | | | | |
                                            | | | | |     |1|2|3|4|      ________________
                                            | | | | |     | | | | |     |  ______________triangle A: C0__L1
                                            | | | | |     | | | | |     | |  ____________triangle B: C0__L1
                                            | | | | |     | | | | |     | | |  __________triangle C: C0__L1
                                            | | | | |     | | | | |     | | | |  ________triandle D: C0__L1
                                            | | | | |     | | | | |     | | | | |
                                            |1|2|3|4|     | | | | |     | | | | |
                                            | | | | |     |1|2|3|4|     | | | | |
                                            | | | | |     | | | | |     |1|2|3|4|
  (__m256d)    __builtin_ia32_vfmaddpd256 ( (__v4df )__A, (__v4df )__B, (__v4df )__C ) ~/ per CPU-core @ 3.0 GHz ( for actual uops durations check Agner or Intel CPU documentation )
  can
  perform 4-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
         24-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU

  using AVX-512 empowered CPU, can use 8-in-1 VFMADDs
  could
  perform 8-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
         48-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU

c) TIPS FOR THE BEST PARALLEL CODE + ULTIMATE PERFORMANCE:

Step -1: GPU / CUDA costs v/s benefits

If your PhD-mentor, professor, boss or Project Manager indeed insists you to develop a GPU-computing c++/CUDA code solution for this very problem, the best next step is to ask for getting any better suited GPU-card for such a task, than the one you posted.

Your indicated card, the Q400 GPU has just 2 SMX (48KB L1-cache each) is not quite fit for a seriously meant parallel CUDA-computing, having about 30X-less processing devices to actually do any SIMD-thread-block computing, not mentioning its small memory and tiny L1 / L2 on-SMX-caches. So, after all CUDA-related design and optimisation efforts, there will be not more, but a single (!) pair of the GPU-SMX warp32-wide thread-block executions in the SIMD-threads, so there is no big circus to be expected from this GP107-based device and there are 30+X-better equipped devices for some indeed highly performing parallel-processing available COTS )

Step 0: pre-compute and pre-arrange data ( MAXIMISE cache-line COHERENCY ):

Here, it makes sense to optimise the best macroscoping looping of the algorithm, so that you "benefit" most from cache hits ( i.e. best re-use of "fast" data, pre-fetched already )

So, may test, if it is cache-wise faster to rather distribute the work over N-concurrent-workers, who process disjunct working-blocks of triangles, where they each loop over the smallest memory-area --- all the points ( ~ 10k * 2 * 4B ~ 80 kB ), before moving into next triangle in the working-block. Ensuring the row-first array alignment into memory-area is vital ( FORTRAN guys can tell a lot about the costs/benefits of just this trick for an HPC-fast compact/vectorised vector-algebra and matrix-processing )

The benefit?

The cached coefficients will be re-used ~ 10k times (at a cost of ~ 0.5~1.0 [ns], instead of re-fetching costs of + 100 ~ 300 [ns] if these would have to be re-read by a RAM memory access). Difference, about ~ 200X ~ 600X is worth an effort to best align the workflow sub-ordinated to data-access-patterns and cache-line resources.

Results are atomic - any point will belong to one and only one (non-overlapping) triangles.

This simplifies the a priori non-colliding and relatively sparse writes into a resulting vector, into where any detected point-inside-triangle can freely report an index of such triangle found.

Using this vector of results for potentially avoiding any re-testing on points, where there has already been performed a match ( index is non-negative ) isnot very efficient, as the costs of re-reading such indication and re-arranging the compact alignment of 4-in-1 or 8-in-1 point-in-triangle tests, would become adversely expensive to potential "savings" of not re-testing an already mapped point.

So the 10k points over a block of 300k triangles may yield a workflow of:
a split of some 300k / 6 cores
~ 50k triangles / 1 core
~ 50k * 10 k point-in-triangle tests per core
~ 500M point-in-triangle tests per core @ 3.0+ GHz
~ 125M AVX-2 vector-4-in-1-compact test-executions per-core
~ 125M tests x 10 uops-instructions @ 3.0 GHz ...
which is 1.25G uops @ 3.0+ GHz ... second(s)? Yes, there is a pretty strong motivation to go down here, towards this ultimate performance and direct the further work this way.

SO, HERE WE ARE:

The principally achievable HPC-target is in a range of a few seconds for 300+k Triangles / 10+k Points on just a 6-core AVX-2 i5 8500 @ 3.0+ GHz

Worth an effort, isn't it?

回答2:

I think you could make an array with boundaries for each triangle: top, bottom, right and left extremes. Then, compare your point to these boundaries. If it falls within one, THEN see if it really is within the triangle. That way, the 99.9% case doesn't involve a double floating-point multiply and a number of additions - just comparisons. The computationally expensive operations are only undertaken if the point is within the rectilinear extremes of the triangle.

This could be sped up further yet, by e.g. sorting on e.g. the topmost extreme of the triangles, and using a binary search; and then begin by finding the point that is the highest point that is below your triangle, and then, checking the ones above it. That way, you'd only have to check just over half as many. If there were an upper bound on the height of the extremes of the triangles, you could check far less yet. Note that this latter strategy will make your source code more complex - so it will be a case of determining how much effort you wish to put into the optimization for how much result. The first part seems like it would be fairly easy, and help a lot. The sorted list: more effort for only almost halving your operations. I'd see if the first strategy was enough for you first.

回答3:

Quad trees are nice if you have to do lots of single queries, but if you've got 10k to do all at once there is an algorithm purpose-built for this: The sweep line. I'm getting query times of less than 5 seconds on similar data in a single thread.

We're going to take a vertical line and sweep it from left to right across your 2d plane, and we're going to keep track of which triangles that line is intersecting at any point in time. So, whenever your line intersects one of your query points as it's sweeping, you only ever have to check the triangles that your vertical line is overlapping. No other triangles can contain that point.

I kind of hate the name "sweepline" in this case because it gives the mental image of smoothly traversing the plane, which it doesn't do. It just jumps to the next position of interest in order from left to right.

Depending on the ratio of triangles to queries, you could also put the overlapping triangles in a segment tree or interval tree by their y-coordinates, but this was actually slower for my data (or my implementation was bad. Also possible). Definitely worth a try though, especially if you can put off balancing the tree until its needed.

The setup time is just 3 sorts:

Get a vector of pointers to your triangles sorted by the minimum x-value of the tri
Get a vector of pointers to your triangles sorted by the maximum x-value of the tri
Get a vector of pointers to your points sorted by their x-value

Then we sweep:

Initialize the "current positions" for each of your 3 sorted lists to 0
Look for the lowest x-value at the current position in your 3 sorted lists. This will be the next x-position of your sweepline
- If it's the leftmost point of a triangle, then add that triangle to your "overlapping" vector
- If it's a query point, then check that point against all the currently overlapping tris
- If it's the rightmost point of a triangle, then remove it from the overlapping vector.
increment the current position for that list
Loop until you're out of points, or out of tris.

Why is this better than a quadtree? Because we're always keeping track of the tris currently intersecting our sweepline. This means we're re-using data between queries, and quadtrees don't really do that. Quadtrees will have much better single-query performance, but this kind of bulk lookup is made for the sweepline.

回答4:

Using boost's quadtratic rtree to find the closest triangle did perfectly the job. The algo is running in less than a minute now (for 75k triangles and 100k points (10 times more !))

I ended up building a tree by puting in a box every triangle, & looking for the Knn of a point & test the corresponding triangles :) Was expecting way more trouble going into a new domain like Spatial database, but Boost is really an insane library

来源：https://stackoverflow.com/questions/56937577/optimisation-tips-to-find-in-which-triangle-a-point-belongs

标签

c++

performance

optimization

parallel-processing

parallelism-amdahl