Is it possible to use CUDA in order to compute the frequency of elements inside a sorted array efficiently?

问题

I'm very new to Cuda, I've read a few chapters from books and read a lot of tutorials online. I have made my own implementations on vector addition and multiplication.

I would like to move a little further, so let's say we want to implement a function that takes as an input a sorted array of integers.

Our goal is to find the frequencies of each integer that is in the array.

Sequentially we could scan the array one time in order to produce the output. The time complexity would be O(n).

Since the groups are different, I guess it must be possible to take advantage of CUDA.

Suppose this is the array

In order to achieve full parallelism, each thread would have to know exactly which part of the array it has to scan in order to find the sum. This can only be achieved if we use another array called int dataPosPerThread[] which for each thread id the dataPosPerThread[threadId] would have as value the starting position on the initial array. So, that would mean that each thread would know where to start and where to finish.

However in this way we won't gain anything, because it would take us O(n) time in order to find the positions. Eventually the total cost would be O(n) + cost_to_transfer_the_data_to_the_gpu + O(c) + cost_to_transfer_the_results_to_the_gpu where O(c) is the constant time it would take for the threads to find the final output, assuming of course that we have many different integers inside the initial array.

I would like to avoid the extra O(n) cost.

What I've thought so far is, having an array of size arraySize, we specify the total amount of threads that will be used, let's say totalAmountOfThreads which means that each thread will have to scan totalAmountOfThreads/arraySize values.

The first thread(id 0) would start scanning from position 0 until position totalAmountOfThreads/arraySize.

The second thread would start from totalAmountOfThreads/arraySize + 1 and so on.

The problem is though that some thread might be working with different integer groups or with one group that has more values being processed by other threads. For instance in the above example if we suppose that we will have 6 threads, each thread will take 2 integers of the array, so we will have something like this:

   1     <-------- thread 0
   1
   1     <-------- thread 1
   1
   2     <-------- thread 2
   2
   3     <-------- thread 3
   3
   5     <-------- thread 4
   5
   6     <-------- thread 5
   7

As you can see thread 0 has only 1 values, however there are other 1 values that are being processed by thread 2. In order to achieve parallelism though, these threads have to be working on unrelated data. Assuming that we will use this logic, each thread will compute the following results:

   thread 0 => {value=1, total=2}
   thread 1 => {value=1, total=2}
   thread 2 => {value=2, total=2}
   thread 3 => {value=3, total=2}
   thread 4 => {value=5, total=2}
   thread 5 => {{value=6, total=1}, {value=7, total=1}}

By having this result what can be further achieved? Someone could suggest using an extra hash_map, like unordered_map which can efficiently update for each value computed by a single thread the total variable. However

Unordered_map is not supported by cuda compiler
This would mean that the threads would not be able to take advantage of shared memory because two threads from different blocks could be working with the same values, so the hash map would have to be in the global memory.
Even if the above two weren't a problem, we would still have race conditions between threads when updating the hash map.

What would be a good way in order to approach this problem?

Thank you in advance

回答1:

As @tera has already pointed out, what you're describing is a histogram.

You may be interested in the thrust histogram sample code. If we refer to the dense_histogram() routine as an example, you'll note the first step is to sort the data.

So, yes, the fact that your data is sorted will save you a step.

In a nutshell we are:

sorting the data
marking the boundaries of different elements within the data
computing the distance between the boundaries.

As shown in the sample code, thrust can do each of the above steps in a single function. Since your data is sorted you can effectively skip the first step.

来源：https://stackoverflow.com/questions/15914569/is-it-possible-to-use-cuda-in-order-to-compute-the-frequency-of-elements-inside

标签

c++

cuda

frequency

gpu

sorted