问题
I am very much a newbie in this subject and need help on how to parallelize my code. I have a large 1D array that in reality describes a 3D volume: 21x21x21 single precision values. I have 3 computers that I want to engage in the computation. The operation that is performed on each cell in the grid(volume) is identical for all cells. The program takes in some data and perform some simple arithmetics on them and the return value is assigned to the grid cell.
My non-parallized code is:
float zg, yg, xg;
stack_result = new float[Nz*Ny*Nx];
// StrMtrx[8] is the vertical step size, StrMtrx[6] is the vertical starting point
for (int iz=0; iz<Nz; iz++) {
zg = iz*StRMtrx[8]+StRMtrx[6]; // find the vertical position in meters
// StrMtrx[5] is the crossline step size, StrMtrx[3] is the crossline starting point
for (int iy=0; iy<Ny; iy++) {
yg = iy*StRMtrx[5]+StRMtrx[3]; // find the crossline position
// StrMtrx[2] is the inline step size, StrMtrx[0] is the inline starting point
for (int ix=0; ix < nx; ix++) {
xg = ix*StRMtrx[2]+StRMtrx[0]; // find the inline position
// do stacking on each grid cell
// "Geoph" is the geophone ids, "Ngeo" is the number of geophones involved,
// "pahse_use" is the wave type, "EnvMtrx" is the input data common to all
// cells, "Mdata" is the length of input data
stack_result[ix+Nx*iy+Nx*Ny*iz] =
stack_for_qds(Geoph, Ngeo, phase_use, xg, yg, zg, EnvMtrx, Mdata);
}
}
}
Now I take in 3 computers and divide the volume in 3 vertical segments, so I would then have 3 sub-volumes each 21x21x7 cells. (note the parsing of the volume is in z,y,x). The variable "stack_result" is the complete volume. My parallellized version (which utterly fails, I only get one of the sub-volumes back) is:
MPI_Status status;
int rank, numProcs, rootProcess;
ierr = MPI_Init(&argc, &argv);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
int rowsInZ = Nz/numProcs; // 7 cells in Z (vertical)
int chunkSize = Nx*Ny*rowsInZ;
float *stack_result = new float[Nz*Ny*Nx];
float zg, yg, xg;
rootProcess = 0;
if(rank == rootProcess) {
offset = 0;
for (int n = 1; n < numProcs; n++) {
// send rank
MPI_Send(&n, 1, MPI_INT, n, 2, MPI_COMM_WORLD);
// send the offset in array
MPI_Send(&offset, 1, MPI_INT, n, 2, MPI_COMM_WORLD);
// send volume, now only filled with zeros,
MPI_Send(&stack_result[offset], chunkSize, MPI_FLOAT, n, 1, MPI_COMM_WORLD);
offset = offset+chunkSize;
}
// receive results
for (int n = 1; n < numProcs; n++) {
int source = n;
MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
MPI_Recv(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD, &status);
}
} else {
int rank;
int source = 0;
int ierr = MPI_Recv(&rank, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
ierr = MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
ierr = MPI_Recv(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD, &status);
int nz = rowsInZ; // sub-volume vertical length
int startZ = (rank-1)*rowsInZ;
for (int iz = startZ; iz < startZ+nz; iz++) {
zg = iz*StRMtrx[8]+StRMtrx[6];
for (int iy = 0; iy < Ny; iy++) {
yg = iy*StRMtrx[5]+StRMtrx[3];
for (int ix = 0; ix < Nx; ix++) {
xg = ix*StRMtrx[2]+StRMtrx[0];
stack_result[offset+ix+Nx*iy+Nx*Ny*iz]=
stack_for_qds(Geoph, Ngeo, phase_use, xg, yg, zg, EnvMtrx, Mdata);
} // x-loop
} // y-loop
} // z-loop
MPI_Send(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD);
MPI_Send(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD);
} // else
write("stackresult.dat", stack_result);
delete [] stack_result;
MPI_Finalize();
Thanks in advance for your patience.
回答1:
You are calling write("stackresult.dat", stack_result);
in all MPI ranks. As a result, they all write into and thus overwrite the same file and what you see is the content written by the last MPI process to execute that code statement. You should move the writing into the body of the if (rank == rootProcess)
conditional so that only the root process will write.
As a side note, sending the value of the rank is redundant - MPI already assigns each process a rank that ranges from 0
to #processes - 1
. That also makes sending of the offset redundant since each MPI process could easily compute the offset on its own based on its rank.
来源:https://stackoverflow.com/questions/31649447/mpi-help-on-how-to-parallelize-my-code