What is the most efficient for speed algorithm to solve the following problem?
Given 6 arrays, D1,D2,D3,D4,D5 and D6 each containing 6 numbers like:
D1[0
With only 36 values to compare, the most efficient approach would be to not use CUDA at all.
Just use a CPU loop.
If you change your inputs, I'll change my answer.
I was a little bit confused by your question, but I think I have it enough to help you get started.
#define ROW 6
#define COL 6
int D[ROW][COL]; // This is all of your D arrays in one 2 dimensional array.
Next you should probably use nested for loops. Each loop will correspond to a dimension of D
. Remember that the order of your indexes matters. The easiest way to keep it straight in C is to remember that D[i]
is a valid expression even when D
has more than one dimension (and would evaluate to a pointer to a row : a sub array).
If you can't change the independent D arrays to one multidimensional array you could easily make a pointer array whose members point to the heads of each of those arrays and achieve the same effect.
Then you can use the break statement to break out of the inner loop after you have determined that the current D[i]
doesn't match the ans
.
I did a direct trivial C implementation of the algorithm provided by the original poster. It is here
As other proposed the first thing to do is to roll up the code. Unrolling is not really good for speed as it lead to code cache misses. I began by rolling inner loops and got this. Then I rolled the outer loop and removed the now useless gotos and got code below.
EDIT: I changed several time the C code because even as simple as it is there seems to be problems when JIT compiling or executing it with CUDA (and CUDA seems not to be very verbose about errors). That is why the piece of code below use globals... and that is just trivial implementation. We are not yet going for speed. It says much about premature optimization. Why bother to make it fast if we can't even make it work ? I guess there is still issues as CUDA seems to impose many restrictions on the code you can make work if I believe the Wikipedia article. Also maybe we should use float instead of int ?
#include <stdio.h>
int D1[6] = {3, 4, 5, 6, 7, 8};
int D2[6] = {3, 4, 5, 6, 7, 8};
int D3[6] = {3, 4, 5, 6, 7, 8};
int D4[6] = {3, 4, 5, 6, 7, 8};
int D5[6] = {3, 4, 5, 6, 7, 8};
int D6[6] = {3, 4, 5, 6, 7, 9};
int ST1[1] = {6};
int ans[6] = {1, 4, 5, 6, 7, 9};
int * D[6] = { D1, D2, D3, D4, D5, D6 };
/* beware D is passed through globals */
int algo(int * ans, int ELM){
int a, e, p;
for (a = 0 ; a < 6 ; a++){
for (e = 0 ; e < ELM ; e++){
for (p = 0 ; p < 6 ; p++){
if (D[p][e] == ans[a]){
goto cont;
}
}
}
return 0; //bad row of numbers found
cont:;
}
return 1;
}
int main(){
int res;
res = algo(ans, ST1[0]);
printf("algo returned %d\n", res);
}
Now that is interesting, because we can understand what code is doing. By the way doing this packing job I corrected several oddities in the original question. I believe it was typos as it wasn't logical at all in the global context. - goto always jump to two (it should have progressed) - the last test checked ans[0] instead of ans[5]
please Mark, correct me if I'm wrong in the above asumptions on what the original code should do and your original algorithm is typo free.
What the code is doing it for each value in ans check it is present in a two dimentional array. If any number miss it returns 0. If all numbers are found it returns 1.
What I would do to get a real fast code is not to implement it in C but in another language like python (or C++) where set is a basic data structure provided by standard libraries. Then I would build a set with all the values of the arrays (that is O(n)) and check if numbers searched are present in set or not (that is O(1)). The final implementation should be faster than the existing code at least from an algorithmic point of view.
Python example is below as it is really trivial (print true/false instead of 1/0 but you get the idea):
ans_set = set(ans)
print len(set(D1+D2+D3+D4+D5+D6).intersection(ans_set)) == len(ans_set)
Here is a possible C++ implementation using sets:
#include <iostream>
#include <set>
int algo(int * D1, int * D2, int * D3, int * D4, int * D5, int * D6, int * ans, int ELM){
int e, p;
int * D[6] = { D1, D2, D3, D4, D5, D6 };
std::set<int> ans_set(ans, ans+6);
int lg = ans_set.size();
for (e = 0 ; e < ELM ; e++){
for (p = 0 ; p < 6 ; p++){
if (0 == (lg -= ans_set.erase(D[p][e]))){
// we found all elements of ans_set
return 1;
}
}
}
return 0; // some items in ans are missing
}
int main(){
int D1[6] = {3, 4, 5, 6, 7, 8};
int D2[6] = {3, 4, 5, 6, 7, 8};
int D3[6] = {3, 4, 5, 6, 7, 8};
int D4[6] = {3, 4, 5, 6, 7, 8};
int D5[6] = {3, 4, 5, 6, 7, 8};
int D6[6] = {3, 4, 5, 6, 7, 1};
int ST1[1] = {6};
int ans[] = {1, 4, 5, 6, 7, 8};
int res = algo(D1, D2, D3, D4, D5, D6, ans, ST1[0]);
std::cout << "algo returned " << res << "\n";
}
We make some performance hypothesis : the content of ans should be sorted or we should construct it otherwise, we suppose that content of D1..D6 will change between calls to algo. Hence we do not bother constructing a set for it (as set construction is O(n) anyway we wouldn't gain anything if D1..D6 are changing). But if we call several times algo with the same D1..D6 and that is ans that change we should do the opposite and transform D1..D6 into one larger set that we keep available.
If I stick to C I could do it as follow:
As data size are really small here , we could also try going for micro optimizations. It could pay better here. Don't know for sure.
EDIT2: there is hard restrictions on the subset of C supported by CUDA. The most restrictive one is that we shouldn't use pointers to main memory. That will have to be taken into account. It explains why the current code does not work. The simplest change is probably to call it for every array D1..D6 in turn. To keep it short and avoid function call cost we may use a macro or an inline function. I will post a proposal.
In case the range of the numbers is limited, it would be probably easier to make a bit array, like this:
int IsPresent(int arrays[][6], int ans[6], int ST1)
{
uint32_t bit_mask = 0;
for(int i = 0; i < 6; ++ i) {
for(int j = 0; j < ST1; ++ j) {
assert(arrays[i][j] >= 0 && arrays[i][j] < 32); // range is limited
bit_mask |= 1 << arrays[i][j];
}
}
// make a "list" of numbers that we have
for(int i = 0; i < 6; ++ i) {
if(((bit_mask >> ans[i]) & 1) == 0)
return 0; // in ans, there is a number that is not present in arrays
}
return 1; // all of the numbers were found
}
This will always run in O(6 * ST1 + 6). Now this has the disadvantage of first going through up to 36 arrays and then checking against six values. If there is a strong precondition that the numbers will be mostly present, it is possible to reverse the test and provide an early exit:
int IsPresent(int arrays[][6], int ans[6], int ST1)
{
uint32_t bit_mask = 0;
for(int i = 0; i < 6; ++ i) {
assert(ans[i][j] >= 0 && ans[i][j] < 32); // range is limited
bit_mask |= 1 << ans[i];
}
// make a "list" of numbers that we need to find
for(int i = 0; i < 6; ++ i) {
for(int j = 0; j < ST1; ++ j)
bit_mask &= ~(1 << arrays[i][j]); // clear bits of the mask
if(!bit_mask) // check if we have them all
return 1; // all of the numbers were found
}
assert(bit_mask != 0);
return 0; // there are some numbers remaining yet to be found
}
This will run at most in O(6 * ST1 + 6), at best in O(6 + 1) if the first number in the first array covers all of ans
(and ans
is six times the same number). Note that the test for bit mask being zero can be either after each array (as it is now) or after each element (that way involves more checking but also earlier cutoff when all the numbers are found). In context of CUDA, the first version of the algorithm would likely be faster, as it involves fewer branches and most of the loops (except the one for ST1) can be automatically unrolled.
However, if the range of the numbers is unlimited, we could do something else. Since there are only up to 7 * 6 = 42 different numbers in ans and all the arrays, it would be possible to map those to 42 different numbers and use a 64-bit integer for a bit mask. But arguably this mapping of numbers to integers would already be enough for the test and it would be possible to skip this test altogether.
Another way to do it would be to sort the arrays and simply count coverage of the individual numbers:
int IsPresent(int arrays[][6], int ans[6], int ST1)
{
int all_numbers[36], n = ST1 * 6;
for(int i = 0; i < 6; ++ i)
memcpy(&all_numbers[i * ST1], &arrays[i], ST1 * sizeof(int));
// copy all of the numbers into a contiguous array
std::sort(all_numbers, all_numbers + n);
// or use "C" standard library qsort() or a bitonic sorting network on GPU
// alternatively, sort each array of 6 separately and then merge the sorted
// arrays (can also be done in parallel, to some level)
n = std::unique(all_numbers, all_numbers + n) - all_numbers;
// this way, we can also remove duplicate numbers, if they are
// expect to occur frequently and make the next test faster.
// std::unique() actually moves the duplicates to the end of the list
// and returns an iterator (a pointer in this case) to one past
// the last unique element of the list - that gives us number of
// unique items.
for(int i = 0; i < 6; ++ i) {
int *p = std::lower_bound(all_numbers, all_numbers + n, ans[i]);
// use binary search to find the number in question
// or use "C" standard library bfind()
// or implement binary search yourself on GPU
if(p == all_numbers + n)
return 0; // not found
// alternately, make all_numbers array of 37 and write
// all_numbers[n] = -1; before this loop. that will act
// as a sentinel and will save this one comparison (assuming
// that there is a value that is guaranteed not to occur in ans)
if(*p != ans[i])
return 0; // another number found, not ans[i]
// std::lower_bound looks for the given number, or for one that
// is greater than it, so if the number was to be inserted there
// (before the bigger one), the sequence would remain ordered.
}
return 1; // all the numbers were found
}
This will run in O(n) for copying, O(36 log 36) for sorting, optionally O(n) for unique
(where n is 6 * ST1) and O(n log n) for searching (where n can be less than 6 * ST1 if unique
is employed). The whole algorithm therefore runs in linearithmic time. Note that this does not involve any dynamic memory allocation and as such is suitable even for GPU platforms (one would have to implement sorting and port std::unique()
and std::lower_bound()
, but all those are farily simple functions).