I have several blocks were each block executes on separate part of an integer array. As an example: block one from array[0] to array[9] and block two from array[10] to array
This will not benefit the original poster but for those who came to this page looking for an answer I would second the recommendation to use thrust that already has a function thrust::max_element that does exactly that - returns an index of the largest element. min_element and minmax_element functions are also provided. See thrust documentation for details here.
As well as the suggestion to use Thrust, you could also use the CUBLAS cublasIsamax
function.
If I understood exactly what you want is : Get the index for the array A of the max value inside it.
If that is true then I would suggest you to use the thrust library:
Here is how you would do it:
#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/fill.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <time.h>
using namespace thrust;
// return the biggest of two tuples
template <class T>
struct bigger_tuple {
__device__ __host__
tuple<T,int> operator()(const tuple<T,int> &a, const tuple<T,int> &b)
{
if (a > b) return a;
else return b;
}
};
template <class T>
int max_index(device_vector<T>& vec) {
// create implicit index sequence [0, 1, 2, ... )
counting_iterator<int> begin(0); counting_iterator<int> end(vec.size());
tuple<T,int> init(vec[0],0);
tuple<T,int> smallest;
smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)),
init, bigger_tuple<T>());
return get<1>(smallest);
}
int main(){
thrust::host_vector<int> h_vec(1024);
thrust::sequence(h_vec.begin(), h_vec.end()); // values = indices
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
int index = max_index(d_vec);
std::cout << "Max index is:" << index <<std::endl;
std::cout << "Value is: " << h_vec[index] <<std::endl;
return 0;
}
The size of your array in comparison to shared memory is almost irrelevant, since the number of threads in each block is the limiting factor rather than the size of the array. One solution is to have each thread block work on a size of the array the same size as the thread block. That is, if you have 512 threads, then block n will be looking at array[ n ] thru array[ n + 511 ]. Each block does a reduction to find the highest member in that portion of the array. Then you bring the max of each section back to the host and do a simple linear search to locate the highest value in the overall array. Each reduction no the GPU reduces the linear search by a factor of 512. Depending on the size of the array, you might want to do more reductions before you bring the data back. (If your array is 3*512^10 in size, you might want to do 10 reductions on the gpu, and have the host search through the 3 remaining data points.)
One thing to watch out for when doing a max value plus index reduction is that if there is more than one identical valued maximum element in your array, i.e. in your example if there were 2 or more values equal to 56, then the index which is returned would not be unique and possibly be different on every run of the code because the timing of the thread ordering over the GPU is not deterministic.
To get around this kind of problem you can use a unique ordering index such as threadid + threadsperblock * blockid, or else the element index location if that is unique. Then the max test is along these lines:
if(a>max_so_far || a==max_so_far && order_a>order_max_so_far)
{
max_so_far = a;
index_max_so_far = index_a;
order_max_so_far = order_a;
}
(index and order can be the same variable, depending on the application.)