What is the most efficient for speed algorithm to solve the following problem?
Given 6 arrays, D1,D2,D3,D4,D5 and D6 each containing 6 numbers like:
D1[0
I did a direct trivial C implementation of the algorithm provided by the original poster. It is here
As other proposed the first thing to do is to roll up the code. Unrolling is not really good for speed as it lead to code cache misses. I began by rolling inner loops and got this. Then I rolled the outer loop and removed the now useless gotos and got code below.
EDIT: I changed several time the C code because even as simple as it is there seems to be problems when JIT compiling or executing it with CUDA (and CUDA seems not to be very verbose about errors). That is why the piece of code below use globals... and that is just trivial implementation. We are not yet going for speed. It says much about premature optimization. Why bother to make it fast if we can't even make it work ? I guess there is still issues as CUDA seems to impose many restrictions on the code you can make work if I believe the Wikipedia article. Also maybe we should use float instead of int ?
#include
int D1[6] = {3, 4, 5, 6, 7, 8};
int D2[6] = {3, 4, 5, 6, 7, 8};
int D3[6] = {3, 4, 5, 6, 7, 8};
int D4[6] = {3, 4, 5, 6, 7, 8};
int D5[6] = {3, 4, 5, 6, 7, 8};
int D6[6] = {3, 4, 5, 6, 7, 9};
int ST1[1] = {6};
int ans[6] = {1, 4, 5, 6, 7, 9};
int * D[6] = { D1, D2, D3, D4, D5, D6 };
/* beware D is passed through globals */
int algo(int * ans, int ELM){
int a, e, p;
for (a = 0 ; a < 6 ; a++){
for (e = 0 ; e < ELM ; e++){
for (p = 0 ; p < 6 ; p++){
if (D[p][e] == ans[a]){
goto cont;
}
}
}
return 0; //bad row of numbers found
cont:;
}
return 1;
}
int main(){
int res;
res = algo(ans, ST1[0]);
printf("algo returned %d\n", res);
}
Now that is interesting, because we can understand what code is doing. By the way doing this packing job I corrected several oddities in the original question. I believe it was typos as it wasn't logical at all in the global context. - goto always jump to two (it should have progressed) - the last test checked ans[0] instead of ans[5]
please Mark, correct me if I'm wrong in the above asumptions on what the original code should do and your original algorithm is typo free.
What the code is doing it for each value in ans check it is present in a two dimentional array. If any number miss it returns 0. If all numbers are found it returns 1.
What I would do to get a real fast code is not to implement it in C but in another language like python (or C++) where set is a basic data structure provided by standard libraries. Then I would build a set with all the values of the arrays (that is O(n)) and check if numbers searched are present in set or not (that is O(1)). The final implementation should be faster than the existing code at least from an algorithmic point of view.
Python example is below as it is really trivial (print true/false instead of 1/0 but you get the idea):
ans_set = set(ans)
print len(set(D1+D2+D3+D4+D5+D6).intersection(ans_set)) == len(ans_set)
Here is a possible C++ implementation using sets:
#include
#include
int algo(int * D1, int * D2, int * D3, int * D4, int * D5, int * D6, int * ans, int ELM){
int e, p;
int * D[6] = { D1, D2, D3, D4, D5, D6 };
std::set ans_set(ans, ans+6);
int lg = ans_set.size();
for (e = 0 ; e < ELM ; e++){
for (p = 0 ; p < 6 ; p++){
if (0 == (lg -= ans_set.erase(D[p][e]))){
// we found all elements of ans_set
return 1;
}
}
}
return 0; // some items in ans are missing
}
int main(){
int D1[6] = {3, 4, 5, 6, 7, 8};
int D2[6] = {3, 4, 5, 6, 7, 8};
int D3[6] = {3, 4, 5, 6, 7, 8};
int D4[6] = {3, 4, 5, 6, 7, 8};
int D5[6] = {3, 4, 5, 6, 7, 8};
int D6[6] = {3, 4, 5, 6, 7, 1};
int ST1[1] = {6};
int ans[] = {1, 4, 5, 6, 7, 8};
int res = algo(D1, D2, D3, D4, D5, D6, ans, ST1[0]);
std::cout << "algo returned " << res << "\n";
}
We make some performance hypothesis : the content of ans should be sorted or we should construct it otherwise, we suppose that content of D1..D6 will change between calls to algo. Hence we do not bother constructing a set for it (as set construction is O(n) anyway we wouldn't gain anything if D1..D6 are changing). But if we call several times algo with the same D1..D6 and that is ans that change we should do the opposite and transform D1..D6 into one larger set that we keep available.
If I stick to C I could do it as follow:
As data size are really small here , we could also try going for micro optimizations. It could pay better here. Don't know for sure.
EDIT2: there is hard restrictions on the subset of C supported by CUDA. The most restrictive one is that we shouldn't use pointers to main memory. That will have to be taken into account. It explains why the current code does not work. The simplest change is probably to call it for every array D1..D6 in turn. To keep it short and avoid function call cost we may use a macro or an inline function. I will post a proposal.