Optimal algorithm for returning top k values from an array of length N

前端 未结 5 472
一个人的身影
一个人的身影 2020-11-27 03:22

I have an array of n floats, and I wish to return the top k (in my case n ~ 100, k ~ 10)

Is there a known optimal solution path for this problem?

Could some

相关标签:
5条回答
  • 2020-11-27 03:36

    if you got a fancy gpu, I can tell you how to compute the top huge k of huge n instances all at the same time, so spread em out on a texture, per instance, and add blend onto a texture with their "height" as the position along the texture.

    But note you have to guess an acceptable range or know it, or you wont spread to your max detail you could have had.

    you clone positions. (you should get a 2, if there is 2 on it, 10 if there is 10 on it.) across all instances. (just say its all on an 8192x8192 texture, 64x64 of these "height" boxes.) and you also skip slots with 0 counts.

    then do a mipped add hierarchy, except you do it like a binary tree, you only treat like its 1 dimension, so take the 2 previous numbers and add them together, and keep doing it for every binary mip.

    then we use these mips (which have collected counts) to discover the approximate location of the k, using all mips in the process, do this on a final thread, youll take huge chunks out of it, then slowly use the more detailed mips to find the per pixel value, that k sits at.

    it makes more sense to do this, if it were all instanced again, then its a thread per threshold discovery. (just say you were running an ANN 128x128 times at once, (translational invarience anyone?) then it makes perfect sense.

    and achieve the threshold height for that count, but its approximate... so you get an approximate k. for n lists.

    You can do a little more work to get the exact k, but in a similarity match, but if you can get away with it being approximate, like if it were getting the top ~k activations, then dont worry about it.

    0 讨论(0)
  • 2020-11-27 03:37

    Short answer: no.

    Longer answer: yes, several mutually incompatible optimal solutions are known. It depends on n, k and what properties of the array you can guarantee.

    If you know nothing about the array the lower bound of complexity is obviously O(n), because all elements of the source array must be examined to see if they fit in the top 10. If you know anything about the source array which allows elements to be safely skipped you should use that knowledge.

    Similarly the upper complexity bound is O(n.log(n)) because you could always choose to find the answer by sorting the array (O(n.log(n)) and returning the first 10 items (O(1)).

    A linear search comparing each item to the tenth-highest found so far and inserting it at the appropriate place in the list of highest-found-so-far items if needed has similar complexity for the average and best-case scenarios and has a worst-case of O(kn) which is significantly better than O(n-squared). For the sizes you estimate I expect this method to perform well.

    If n were much larger (~10000) and k were increased in the same ratio it would probably be worthwhile implementing the quickselect algorithm. Quickselect performs better the more elements you want. If, however, k were not increased in scale with n you should stick with linear searching. Quickselect & friends modify the original array, so are less suitable if you cannot do this in place because you need a bunch more storage and a lot of copying that the algorithm complexity does not include.

    If n is huge (~1e20) you would want to find the k largest from each of a number of partitions of the input array and then find the k-largest from the aggregate of those results, so that you were not trying to analyse more data than you can fit in memory at a time, and to allow the operation to be parallelised efficiently.

    0 讨论(0)
  • 2020-11-27 04:01

    Method 1

    Since k is small, you can use the tournament method to find the kth largest. This method is described in Knuth's Art of Programming, Volume 3, Page 212.

    First create a tournament on n-k+2 elements. Something like a knockout tennis tournament. First you split into pairs and compare the members of the pairs (as if those two played a match and one lost). Then the winners, you split in to pairs again and so on, till you have a winner. You can view it as a tree, with the winner at the top.

    This takes n-k+1 compares exactly.

    Now the winner of these n-k+2 cannot be your kth largest element. Consider its path P up the tournament.

    Of the remaining k-2 now pick one, and follow that up the path P which will give you a new largest. Basically you sort of redo the tournament with the previous winner being replaced by one of the k-2 elements. Let P be the path of the new winner. Now pick another from the k-3 and follow the new path up and so on.

    At the end after you exhaust the k-2, replace the largest with -infinity and the largest of the tournament will be the kth largest. The elements you have thrown away are the top k-1 elements.

    This takes at most n - k + (k-1) [log (n-k+2)] comparisons to find the top k. It uses O(n) memory though.

    In terms of number of compares this should likely beat any selection algorithms.

    Method 2

    As an alternative, you can maintain a min-heap of k elements.

    First insert k elements. Then for each element of array, if it is less than the min element of the heap, throw it away. Otherwise, delete-min of the heap and insert the element from the array.

    At the end, the heap will contain the top k elements. This will take O(n log k) compares.

    Of course, if n is small, just sorting the array should be good enough. The code will be simpler too.

    0 讨论(0)
  • 2020-11-27 04:03

    You can do this in O(n) by using a selection algorithm. Find the kth largest element with the partition algorithm, then all the elements after it will be larger than it, and those are your top k.

    If you need those top k in sorted order, you can then sort them in O(k log k).

    0 讨论(0)
  • 2020-11-27 04:03

    Following is a heap based elegant solution in Java with complexity O(nlogK). It's not the most efficient but i think it's easy enough to understand. You can change Integer to Float if you want a float based solution

    import java.util.Arrays;
    import java.util.PriorityQueue;
    
    public class FindKLargest {
    
    public static void find(int[] A, int k) {
    
        PriorityQueue<Integer> pq = new PriorityQueue<>(k);// Min heap because the element has to be greater
                                                            // than the smallest element in the heap in order
                                                            // to be qualified to be a member of top k elements.
        for (int i = 0; i < A.length; i++) {
            if (i < k) // add until heap is filled with k elements.
                pq.add(A[i]);
            else if (pq.peek() < A[i]) { // check if it's bigger than the
                                            // smallest element in the heap.
                pq.poll();
                pq.add(A[i]);
            }
        }
        int[] topK = new int[pq.size()];
        int index = 0;
        while (index != k)
            topK[index++] = pq.poll();
        System.out.println(Arrays.toString(topK));
    }
    
    public static void main(String[] args) {
        int[] arr = { 1, -2, -3, -4, -5 };
        find(arr, 4);
    }
    

    }

    0 讨论(0)
提交回复
热议问题