How to get the K most distant points, given their coordinates?

后端 未结 5 2256
无人及你
无人及你 2021-02-20 12:41

We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....

  • We have N columns each with int/float values in a table.
5条回答
  •  再見小時候
    2021-02-20 13:26

    Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.

    I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.

    It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.

    We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:

    right = lin_2_D.nlargest(8, ['x'])
    left = lin_2_D.nsmallest(8, ['x'])
    
    graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
    sns.scatterplot(x = right['x'], y = right['y'],  color = 'red')
    sns.scatterplot(x = left['x'], y = left['y'],  color = 'green')
    
    fig = graph.figure
    fig.set_size_inches(8,3)
    

    What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.

    You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.

    The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.

    The unfortunate part is that we run into four challenges:

    Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:

    A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.

    edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.

    B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"

    Challenge 2: The Curse of Dimensionality. Nuff Said.

    Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.

    Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.

    So, what to do about this?

    Nothing.

    Wait. What?!?

    Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:

    Intuitively, when a situation is sufficiently complex or uncertain, only the simplest methods are valid. Surprisingly, however, common-sense heuristics based on these robustly applicable techniques can yield results which are almost surely optimal.

    In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.

    Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).

    If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.

    Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.

    Finally, let's get to a heuristic:

    • Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).

    More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.

    Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.

    So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...

    • If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data

    • If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.

    • For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.

    • When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.

    One last thing: You should seriously look at xarray if you're not already familiar with it.

    Hope all this helps and I also hope you'll read through the links. It'll be worth the time.

    *It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.

提交回复
热议问题