MPI_ERR_BUFFER when performing MPI I/O

时光毁灭记忆、已成空白 提交于 2020-01-05 18:25:49

问题


I am testing MPI I/O.

  subroutine save_vtk
    integer :: filetype, fh, unit
    integer(MPI_OFFSET_KIND) :: pos
    real(RP),allocatable :: buffer(:,:,:)
    integer :: ie

    if (master) then
      open(newunit=unit,file="out.vtk", &
           access='stream',status='replace',form="unformatted",action="write")
      ! write the header
      close(unit)
    end if

    call MPI_Barrier(mpi_comm,ie)

    call MPI_File_open(mpi_comm,"out.vtk", MPI_MODE_APPEND + MPI_MODE_WRONLY, MPI_INFO_NULL, fh, ie)

    call MPI_Type_create_subarray(3, int(ng), int(nxyz), int(off), &
       MPI_ORDER_FORTRAN, MPI_RP, filetype, ie)

    call MPI_type_commit(filetype, ie)

    call MPI_Barrier(mpi_comm,ie)
    call MPI_File_get_position(fh, pos, ie)
    call MPI_Barrier(mpi_comm,ie)

    call MPI_File_set_view(fh, pos, MPI_RP, filetype, "native", MPI_INFO_NULL, ie)

    buffer = BigEnd(Phi(1:nx,1:ny,1:nz))

    call MPI_File_write_all(fh, buffer, nx*ny*nz, MPI_RP, MPI_STATUS_IGNORE, ie)

    call MPI_File_close(fh, ie)

  end subroutine

The undefined variables come from host association, some error checking removed. I receive this error when running it on a national academic cluster:

*** An error occurred in MPI_Isend
*** reported by process [3941400577,18036219417246826496]
*** on communicator MPI COMMUNICATOR 20 DUP FROM 0
*** MPI_ERR_BUFFER: invalid buffer pointer
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

The error is triggered by the call to MPI_File_write_all. I am suspecting it may be connected with size of the buffer which is the full nx*ny*nz which is in the order of 10^5 to 10^6., but I cannot exclude a programming error on my side, as I have no prior experience with MPI I/O.

The MPI implementation used is OpenMPI 1.8.0 with the Intel Fortran 14.0.2.

Do you know how to make it work and write the file?

--- Edit2 ---

Testing a simplified version, the important code remains the same, full source is here. Notice it works with gfortran and fails with different MPI's with Intel. I wasn't able to compile it with PGI. Also I was wrong in that it fails only on different nodes, it fails even in single process run.

>module ad gcc-4.8.1
>module ad openmpi-1.8.0-gcc
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
>mpirun a.out
 Trying to decompose in           1           1           2 process grid.

>module rm openmpi-1.8.0-gcc
>module ad openmpi-1.8.0-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 MPI_ERR_IO: input/output error                                                 



>module rm openmpi-1.8.0-intel
>module ad openmpi-1.6-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 MPI_ERR_IO: input/output error                                                 



[luna24.fzu.cz:24260] *** An error occurred in MPI_File_set_errhandler
[luna24.fzu.cz:24260] *** on a NULL communicator
[luna24.fzu.cz:24260] *** Unknown error
[luna24.fzu.cz:24260] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     After MPI_FINALIZE was invoked
  Local host: luna24.fzu.cz
  PID:        24260
--------------------------------------------------------------------------
>module rm openmpi-1.6-intel
>module ad mpich2-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 Other I/O error , error stack:
ADIOI_NFS_WRITECONTIG(70): Other I/O error Bad a
 ddress  

回答1:


In line

 buffer = BigEnd(Phi(1:nx,1:ny,1:nz))

the array buffer should be allocated automatically to the shape of the right hand side according to the Fortran 2003 standard (not in Fortran 95). Intel Fortran as of version 14 does not do this by default., it requires the option

-assume realloc_lhs

to do that. This option is included (with other options) in option

-standard-semantics

Because this option was not in effect when the code in the question was tested the program accessed an unallocated array and undefined behavior leading to a crash followed.



来源:https://stackoverflow.com/questions/23618034/mpi-err-buffer-when-performing-mpi-i-o

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!