[Pkg-openmpi-maintainers] Bug#851918: minimum example program + hack in fftw2 that hides the bug on s390x

Boud Roukema boud-debian at cosmo.torun.pl
Sat Jan 28 02:51:50 UTC 2017


DESCRIPTION:

The openmpi bug #851918 / mpgrafic bug #851923 on the s390x
architecture appears to be related to fortran/C interfacing
in terms of referencing/dereferencing pointers. The minimal
test program and example compilation and run below illustrate
the bug.

It looks like fftw-2.1.5/mpi/fftw_f77_mpi.h is designed to prefer that
the mpi implementation be responsible for handling this interface,
i.e. the FFTW_MPI_COMM_F2C( ) preprocessor macro is preferably
set to be MPI_COMM_F2C( ). This seems to work on most of the
official architectures, but on s390x, this gives an invalid
result (it should be a pointer, struct ompi_communicator_t *,
for openmpi).

A hack that works on s390x is to artificially turn off
HAVE_MPI_COMM_F2C in fftw/config.h . This is not a serious
sustainable solution, because config.h is normally regenerated
by autoconf. The open question seems to be whether this should
be handled in openmpi or fftw2. The fact that it can be solved by
a hack in fftw2 doesn't imply that the source of the bug is in fftw2
rather than openmpi.

VERSIONS:

fftw version: fftw_2.1.5-4.1
openmpi version: 2.0.2~git.20161225-9
architecture: s390x
distribution: sid

MINIMAL TEST PROGRAM minimal.f90:

program minimal
   implicit none
   integer, parameter :: fftw_estimate=0
   integer, parameter :: fftw_real_to_complex=-1
   integer, parameter :: i8b=selected_int_kind(18)
   integer :: ierr, nx = 32
   integer(i8b) :: plan
#include "mpif.h"

   call mpi_init(ierr)
   call mpi_barrier(mpi_comm_world,ierr)
   call rfftw3d_f77_mpi_create_plan(plan,mpi_comm_world,nx,nx,nx, &
        fftw_real_to_complex, fftw_estimate)
   call mpi_finalize(ierr)
   stop
end program minimal


Build-Depends: fftw-dev, gfortran, mpi-default-dev, mpi-default-bin

COMPILATION:
mpifort -cpp minimal.f90 -o ./minimal -lrfftw_mpi -lfftw_mpi -lrfftw -lfftw

RUN:
mpirun -n 1 --mca plm_rsh_agent sh ./minimal

RESULT:
- amd64/jessie - (apparently incorrect) warning Deprecated parameter: plm_rsh_agent
given twice, no errors.

- s390x/sid -

[zelenka:21289] *** An error occurred in MPI_Comm_dup
[zelenka:21289] *** reported by process [3591634945,0]
[zelenka:21289] *** on communicator MPI_COMM_WORLD
[zelenka:21289] *** MPI_ERR_COMM: invalid communicator
[zelenka:21289] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[zelenka:21289] ***    and potentially your MPI job)


SOURCE:
In fftw-2.1.5 source directory:

# --disable-float means choose double precision
autoreconf -f -i && ./configure --disable-float --enable-mpi --enable-shared && make clean && make

mkdir -p ../usr/lib ../usr/include # temporary

cp -pv */.libs/lib*[^i] ../usr/lib/ && cp -pv */*.h ../usr/include/ # install

In the minimal.f90 source directory:

mpifort -L../usr/lib -I../usr/include -cpp minimal.f90 \
   -o ./minimal -lrfftw_mpi -lfftw_mpi -lrfftw -lfftw

LD_LIBRARY_PATH=../usr/lib mpirun -n 1 --mca plm_rsh_agent sh ./minimal

Result:
[zelenka:21289] *** An error occurred in MPI_Comm_dup
[zelenka:21289] *** reported by process [3591634945,0]
[zelenka:21289] *** on communicator MPI_COMM_WORLD
[zelenka:21289] *** MPI_ERR_COMM: invalid communicator
[zelenka:21289] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[zelenka:21289] ***    and potentially your MPI job)


HACK:
--- fftw/config.h.orig  2017-01-28 01:28:14.747221802 +0000
+++ fftw/config.h       2017-01-28 02:05:13.630034126 +0000
@@ -126,8 +126,8 @@
  /* Define if you have the MPI library. */
  #define HAVE_MPI 1

-/* desc */
-#define HAVE_MPI_COMM_F2C /**/
+/* See mpi/fftw_f77_mpi.h; undefine this for s390x */
+/* #define HAVE_MPI_COMM_F2C */

  /* Define if you have POSIX threads libraries and header files. */
  /* #undef HAVE_PTHREAD */

RECOMPILE:
make  && cp -pv */.libs/lib*[^i] ../usr/lib/

In the minimal.f90 source directory, compile using this hacked version of fftw-2.1.5-4.1:

mpifort -L../usr/lib -I../usr/include -cpp minimal.f90 \
   -o ./minimal -lrfftw_mpi -lfftw_mpi -lrfftw -lfftw

LD_LIBRARY_PATH=../usr/lib mpirun -n 1 --mca plm_rsh_agent sh ./minimal

This runs correctly, without errors (and no output).

This hack cannot count as a long-term sustainable fix, but
hopefully may help in finding a fix.




More information about the Pkg-openmpi-maintainers mailing list