I recently wrote a small number-crunching program that basically loops over an N-dimensional grid and performs some calculation at each point.
for (int i1 =
If you never coded a multithread application, I bare you to begin with OpenMP:
In your example, you should just have to add this pragma:
#pragma omp parallel shared(histogram)
{
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
}
With this pragma, the compiler will add some instruction to create threads, launch them, add some mutexes around accesses to the histogram
variable etc... There are a lot of options, but well defined pragma do all the work for you. Basically, the simplicity depends on the data dependency.
Of course, the result should not be optimal as if you coded all by hand. But if you don't have load balancing problem, you maybe could approach a 2x speed up. Actually this is only write in matrix with no spacial dependency in it.
The first approach is simple. It is also sufficient if you expect that the load will be balanced evenly over the threads. In some cases, especially if the complexity of bin_index is very dependant on the parameter values, one of the threads could end up with a much heavier task than the rest. Remember: the task is finished when the last threads finishes.
The second approach is a bit more complicated, but balances the load more evenly if the tasks are finegrained enough (the number of tasks is much larger than the number of threads).
Note that you may have issues putting the calculations in separate threads. Make sure that bin_index works correctly when multiple threads execute it simultaneously. Beware of the use of global or static variables for intermediate results.
Also, "histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread, causing the result to be incorrect (if the assignment fetches the value, increments it and stores the resulting value in the array). You could introduce a local histogram for each thread and combine the results to a single histogram when all threads have finished. You could also make sure that only one thread is modifying the histogram at the same time, but that may cause the threads to block each other most of the time.
I would do something like this:
void HistogramThread(int i1, Action<int[]> HandleResults)
{
int[] histogram = new int[HistogramSize];
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
HandleResults(histogram);
}
int[] CalculateHistogram()
{
int[] histogram = new int[HistogramSize];
ThreadPool pool; // I don't know syntax off the top of my head
for (int i1=0; i1<N; i1++)
{
pool.AddNewThread(HistogramThread, i1, delegate(int[] h)
{
lock (histogram)
{
for (int i=0; i<HistogramSize; i++)
histogram[i] += h[i];
}
});
}
pool.WaitForAllThreadsToFinish();
return histogram;
}
This way you don't need to share any memory, until the end.
I agree with Sharptooth that your first approach seems like the only plausible one.
Your single threaded app is continuously assigning to memory. To get any speedup, your several threads would need to also be continuously assigning to memory. If only one thread is assigning at a time, you would get no speedup at all. So if your assignments are guarded, the whole exercise would fail.
This would be a dangerous approach since you assigning to shared memory without a guard. But it seems to be worth the danger (if a x2 speedup matters). If you can be sure that all the values of bin_index(i1, i2, i3, i4) are different in your division of the loop, then it should work since the array assignment would be to a different locations in your shared memory. Still, one always should look and hard at approaches like this.
I assume you would also produce a test routine to compare the results of the two versions.
Looking at your bin_index(i1, i2, i3, i4), I suspect your process could not be parallelized without considerable effort.
The only way to divide up the work of calculation in your loop is, again, to be sure that your threads will access the same areas in memory. However, it looks like bin_index(i1, i2, i3, i4) will likely repeat values quite often. You might divide up the iteration into the conditions where bin_index is higher than a cutoff and where it is lower than a cut-off. Or you could divide it arbitrarily and see whether increment is implemented atomically. But any complex threading approach looks unlikely to provide improvement if you can only have two cores to work with to start with.
As I understand it, OpenMP was made just for what you are trying to do, although I have to admit I have not used it yet myself. Basically it seems to boil down to just including a header and adding a pragma clause.
You could probably also use Intel's Thread Building Blocks Library.
If you ever do it in .NET, use the Parallel Extensions.