问题
I tried doing an implementation of Matrix multiplication using CyclicDist
module.
When I test with one locale vs two locales, the one locale is much faster. Is it because the time to communicate between the two Jetson nano boards is really big or is my implementation not taking advantage of the way CyclicDist
works?
Here is my code:
use Random, Time, CyclicDist;
var t : Timer;
t.start();
config const size = 10;
const Space = {1..size, 1..size};
const gridSpace = Space dmapped Cyclic(startIdx=Space.low);
var grid: [gridSpace] real;
fillRandom(grid);
const gridSpace2 = Space dmapped Cyclic(startIdx=Space.low);
var grid2: [gridSpace2] real;
fillRandom(grid2);
const gridSpace3 = Space dmapped Cyclic(startIdx=Space.low);
var grid3: [gridSpace] real;
forall i in 1..size do {
forall j in 1..size do {
forall k in 1..size do {
grid3[i,j] += grid[i,k] * grid2[k,j];
}
}
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear()
I know my implementation is not optimal for distributed memory systems.
回答1:
Probably the main reason that this program is not scaling is that the computation never uses any locales other than the initial one. Specifically, forall loops over ranges, like the ones in your code:
forall i in 1..size do
always run all of their iterations using tasks executing on the current locale. This is because ranges are not distributed values in Chapel and as a result, their parallel iterators don't distribute work across locales. As a result of this, all size**3 executions of the loop body:
grid3[i,j] += grid[i,k] * grid2[k,j];
will run on locale 0 and none of them will run on locale 1. You can see that this is the case by putting the following into the innermost loop's body:
writeln("locale ", here.id, " running ", (i,j,k));
(where here.id
prints out the ID of the locale where the current task is running). This will show that locale 0 is running all iterations:
0 running (9, 1, 1)
0 running (1, 1, 1)
0 running (1, 1, 2)
0 running (9, 1, 2)
0 running (1, 1, 3)
0 running (9, 1, 3)
0 running (1, 1, 4)
0 running (1, 1, 5)
0 running (1, 1, 6)
0 running (1, 1, 7)
0 running (1, 1, 8)
0 running (1, 1, 9)
0 running (6, 1, 1)
...
Contrast this with running a forall loop over a distributed domain like gridSpace
:
forall (i,j) in gridSpace do
writeln("locale ", here.id, " running ", (i,j));
where the iterations will be distributed between the locales:
locale 0 running (1, 1)
locale 0 running (9, 1)
locale 0 running (1, 2)
locale 0 running (9, 2)
locale 0 running (1, 3)
locale 0 running (9, 3)
locale 0 running (1, 4)
locale 1 running (8, 1)
locale 1 running (10, 1)
locale 1 running (8, 2)
locale 1 running (2, 1)
locale 1 running (8, 3)
locale 1 running (10, 2)
...
Since all of the computation is running on locale 0 but half of the data is located on locale 1 (due to the arrays being distributed), lots of communication is generated to fetch remote values from locale 1's memory to locale 0's in order to compute on it.
回答2:
Q : Is it because the time to communicate (1) between the two Jetson nano boards is really big or is my implementation (2) not taking advantage of the way
CyclicDist
works?
The second option is a sure bet: ~ 100 x
worse performance was achieved on CyclicDist
data for small sizes.
Documentation explicitly warns on this, saying:
Cyclic distribution maps indices to locales in a round-robin pattern starting at a given index.
...
Limitations
This distribution has not been tuned for performance.
Adverse impacts on processing efficiency were demonstrable on a single-locale platform, where all data resides in the locale-local memory space, thus without any NUMA inter-board communication add-on costs ever added. Still ~ 100 x
worse performance was achieved, compared to Vass' single-forall{} D3
-iterated sum-product
( not noticed until now Vass' performance motivated change from the original forall-in-D3-do-{}
into another configured forall-in-D2-do-for{}
-tandem-iterated revision - so far, small-size --fast --ccflags -O3 performed test show almost half the length WORSE performance for the forall-in-D2-do-for{}
-iterator-in-iterator results, even worse than the O/P triple-forall{}
original proposal, except for sizes under 512x512 and after -O3 optimisation took place, but for a smallest size 128x128
The highest performance was achieved of ~ 850 [ns]
per cell for the original Vass-D3 solo-iterator, surprisingly without --ccflags -O3 ( which might be obviously changed for larger --size={ 1024 | 2048 | 4096 | 8192 }
data-layouts being processed, the more if wider-NUMA multi-locale and higher parallelism devices are being put into the race ) )
TiO.run platform uses 1 numLocales,
having 2 physical CPU-cores accessible (numPU-s)
with 2 maxTaskPar parallelism limit
The use of the CyclicDist
effects the DATA -into- memory layout, doesn't it?
Validated by measurements on small sizes --size={128 | 256 | 512 | 640} with and without a minor --ccflags -O3
effect
// --------------------------------------------------------------------------------------------------------------------------------
// --fast
// ------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 255818 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3075 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 3040 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2198 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1974 [us] excl. fillRandom()-ops <-- 127x SLOWER with CyclicDist dmapped DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2122 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 252439 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2141444 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 27095 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 25339 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 23493 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 21631 [us] excl. fillRandom()-ops <-- 98x SLOWER then w/o CyclicDist dmapped data
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 21971 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2122417 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16988685 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17448207 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 268111 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 270289 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 250896 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 239898 [us] excl. fillRandom()-ops <-- 71x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 257479 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17391049 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 16932503 [us] excl. fillRandom()-ops <~~ ~2e5 [us] faster without --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35136377 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 362205 [us] incl. fillRandom()-ops <-- 97x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 367651 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345865 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 337896 [us] excl. fillRandom()-ops <-- 103x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 351101 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35052849 [us] excl. fillRandom()-ops <~~ ~3e4 [us] faster without --ccflags -O3
//
// --------------------------------------------------------------------------------------------------------------------------------
// --fast --ccflags -O3
// --------------------
//
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 250372 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 3189 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2966 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product took 2284 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the Vass-D3 orig sum-product took 1949 [us] excl. fillRandom()-ops <-- 126x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 128, 128] the Vass-D2-k ver sum-product took 2072 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 128, 128] the tested forall sum-product over dmapped Cyclic Space took 246965 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2114615 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 37775 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 38866 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product took 32384 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the Vass-D3 orig sum-product took 29264 [us] excl. fillRandom()-ops <-- 71x FASTER than with dmapped CyclicDist DATA
// For grid{1,2,3}[ 256, 256] the Vass-D2-k ver sum-product took 33973 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 256, 256] the tested forall sum-product over dmapped Cyclic Space took 2098344 [us] excl. fillRandom()-ops
//
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17136826 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081273 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 251786 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 266766 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product took 239301 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the Vass-D3 orig sum-product took 233003 [us] excl. fillRandom()-ops <~~ ~6e3 [us] faster with --ccflags -O3
// For grid{1,2,3}[ 512, 512] the Vass-D2-k ver sum-product took 253642 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17025339 [us] excl. fillRandom()-ops
// For grid{1,2,3}[ 512, 512] the tested forall sum-product over dmapped Cyclic Space took 17081352 [us] excl. fillRandom()-ops <~~ ~2e5 [us] slower with --ccflags -O3
//
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35164630 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 363060 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 489529 [us] incl. fillRandom()-ops
// For grid{1,2,3}[ 640, 640] the tested forall sum-product took 345742 [us] excl. fillRandom()-ops <-- 104x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D3 orig sum-product took 353353 [us] excl. fillRandom()-ops <-- 102x SLOWER with dmapped CyclicDist DATA
// For grid{1,2,3}[ 640, 640] the Vass-D2-k ver sum-product took 471213 [us] excl. fillRandom()-ops <~~~12e5 [us] slower with --ccflags -O3
// For grid{1,2,3}[ 640, 640] the tested forall sum-product over dmapped Cyclic Space took 35075435 [us] excl. fillRandom()-ops
In any case, the Chapel team's insights (both design-wise and testing-wise) are important. @Brad was asked for a kind help to provide similar testing-coverage and comparisons for principally higher sizes --size={1024 | 2048 | 4096 | 8192 | ...}
and for the "way-wider"-NUMA-platforms having multi-locale and many-locale solutions, available at Cray for the Chapel team's R&D, that will not suffer from a hardware and ~ 60 [s]
limits on a public, sponsored, shared TiO.RUN platform.
来源:https://stackoverflow.com/questions/59332086/cyclicdist-goes-slower-on-multiple-locales