Counting occurrences of numbers in a CUDA array

前端 未结 4 829
醉梦人生
醉梦人生 2020-12-10 16:37

I have an array of unsigned integers stored on the GPU with CUDA (typically 1000000 elements). I would like to count the occurrence of every number in the array

相关标签:
4条回答
  • 2020-12-10 17:00

    I suppose you can find help in the CUDA examples, specifically the histogram examples. They are part of the GPU computing SDK. You can find it here http://developer.nvidia.com/cuda-cc-sdk-code-samples#histogram. They even have a whitepaper explaining the algorithms.

    0 讨论(0)
  • 2020-12-10 17:10

    As others have said, you can use the sort & reduce_by_key approach to count frequencies. In my case, I needed to get mode of an array (maximum frequency/occurrence) so here is my solution:

    1 - First, we create two new arrays, one containing a copy of input data and another filled with ones to later reduce it (sum):

    // Input: [1 3 3 3 2 2 3]
    // *(Temp) dev_keys: [1 3 3 3 2 2 3]
    // *(Temp) dev_ones: [1 1 1 1 1 1 1]
    
    // Copy input data
    thrust::device_vector<int> dev_keys(myptr, myptr+size);
    
    // Fill an array with ones
    thrust::fill(dev_ones.begin(), dev_ones.end(), 1);
    

    2 - Then, we sort the keys since the reduce_by_key function needs the array to be sorted.

    // Sort keys (see below why)
    thrust::sort(dev_keys.begin(), dev_keys.end());
    

    3 - Later, we create two output vectors, for the (unique) keys and their frequencies:

    thrust::device_vector<int> output_keys(N);
    thrust::device_vector<int> output_freqs(N);
    

    4 - Finally, we perform the reduction by key:

    // Reduce contiguous keys: [1 3 3 3 2 2 3] => [1 3 2 1] Vs. [1 3 3 3 3 2 2] => [1 4 2] 
    thrust::pair<thrust::device_vector<int>::iterator, thrust::device_vector<int>::iterator> new_end;
    new_end = thrust::reduce_by_key(dev_keys.begin(), dev_keys.end(), dev_ones.begin(), output_keys.begin(), output_freqs.begin());
    

    5 - ...and if we want, we can get the most frequent element

    // Get most frequent element
    // Get index of the maximum frequency
    int num_keys = new_end.first  - output_keys.begin();
    thrust::device_vector<int>::iterator iter = thrust::max_element(output_freqs.begin(), output_freqs.begin() + num_keys);
    unsigned int index = iter - output_freqs.begin();
    
    int most_frequent_key = output_keys[index];
    int most_frequent_val = output_freqs[index];  // Frequencies
    
    0 讨论(0)
  • 2020-12-10 17:11

    I'm comparing two approaches suggested at the duplicate question thrust count occurence, namely,

    1. Using thrust::counting_iterator and thrust::upper_bound, following the histogram Thrust example;
    2. Using thrust::unique_copy and thrust::upper_bound.

    Below, please find a fully worked example.

    #include <time.h>       // --- time
    #include <stdlib.h>     // --- srand, rand
    #include <iostream>
    
    #include <thrust\host_vector.h>
    #include <thrust\device_vector.h>
    #include <thrust\sort.h>
    #include <thrust\iterator\zip_iterator.h>
    #include <thrust\unique.h>
    #include <thrust/binary_search.h>
    #include <thrust\adjacent_difference.h>
    
    #include "Utilities.cuh"
    #include "TimingGPU.cuh"
    
    //#define VERBOSE
    #define NO_HISTOGRAM
    
    /********/
    /* MAIN */
    /********/
    int main() {
    
        const int N = 1048576;
        //const int N = 20;
        //const int N = 128;
    
        TimingGPU timerGPU;
    
        // --- Initialize random seed
        srand(time(NULL));
    
        thrust::host_vector<int> h_code(N);
    
        for (int k = 0; k < N; k++) {
            // --- Generate random numbers between 0 and 9
            h_code[k] = (rand() % 10);
        }
    
        thrust::device_vector<int> d_code(h_code);
        //thrust::device_vector<unsigned int> d_counting(N);
    
        thrust::sort(d_code.begin(), d_code.end());
    
        h_code = d_code;
    
        timerGPU.StartCounter();
    
    #ifdef NO_HISTOGRAM
        // --- The number of d_cumsum bins is equal to the maximum value plus one
        int num_bins = d_code.back() + 1;
    
        thrust::device_vector<int> d_code_unique(num_bins);
        thrust::unique_copy(d_code.begin(), d_code.end(), d_code_unique.begin());
        thrust::device_vector<int> d_counting(num_bins);
        thrust::upper_bound(d_code.begin(), d_code.end(), d_code_unique.begin(), d_code_unique.end(), d_counting.begin());  
    #else
        thrust::device_vector<int> d_cumsum;
    
        // --- The number of d_cumsum bins is equal to the maximum value plus one
        int num_bins = d_code.back() + 1;
    
        // --- Resize d_cumsum storage
        d_cumsum.resize(num_bins);
    
        // --- Find the end of each bin of values - Cumulative d_cumsum
        thrust::counting_iterator<int> search_begin(0);
        thrust::upper_bound(d_code.begin(), d_code.end(), search_begin, search_begin + num_bins, d_cumsum.begin());
    
        // --- Compute the histogram by taking differences of the cumulative d_cumsum
        //thrust::device_vector<int> d_counting(num_bins);
        //thrust::adjacent_difference(d_cumsum.begin(), d_cumsum.end(), d_counting.begin());
    #endif
    
        printf("Timing GPU = %f\n", timerGPU.GetCounter());
    
    #ifdef VERBOSE
        thrust::host_vector<int> h_counting(d_counting);
        printf("After\n");
        for (int k = 0; k < N; k++) printf("code = %i\n", h_code[k]);
    #ifndef NO_HISTOGRAM
        thrust::host_vector<int> h_cumsum(d_cumsum);
        printf("\nCounting\n");
        for (int k = 0; k < num_bins; k++) printf("element = %i; counting = %i; cumsum = %i\n", k, h_counting[k], h_cumsum[k]);
    #else
        thrust::host_vector<int> h_code_unique(d_code_unique);
    
        printf("\nCounting\n");
        for (int k = 0; k < N; k++) printf("element = %i; counting = %i\n", h_code_unique[k], h_counting[k]);
    #endif
    #endif
    }
    

    The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, I have had the following timings for a number of N = 1048576 array elements:

    First approach: 2.35ms
    First approach without thrust::adjacent_difference: 1.52
    Second approach: 4.67ms
    

    Please, note that there is no strict need to calculate the adjacent difference explicitly, since this operation can be manually done during a kernel processing, if needed.

    0 讨论(0)
  • 2020-12-10 17:19

    You can implement a histogram by first sorting the numbers, and then doing a keyed reduction.

    The most straightforward method would be to use thrust::sort and then thrust::reduce_by_key. It's also often much faster than ad hoc binning based on atomics. Here's an example.

    0 讨论(0)
提交回复
热议问题