[Pkg-openmpi-maintainers] Bug#851918: openmpi: MPI_Comm_dup error on s390x, ppc64, sparc64 builds of mpgrafic - maybe from openmpi

Boud Roukema boud-debian at cosmo.torun.pl
Thu Jan 19 23:46:10 UTC 2017


Source: openmpi
Version: openmpi-2.0.2~git.20161225-8
Severity: normal

Dear Maintainer,

In this debian automated build of mpgrafic-0.3.7.6-2 (debian
downstream version) on an s390x architecture, `make check' calls
`regression-test-0.3.7.sh', which calls mpgrafic with a standard
input file, expecting a standard output file, but instead 
gives this fatal error:

* lines 878-884 of the html source of
https://buildd.debian.org/status/fetch.php?pkg=mpgrafic&arch=s390x&ver=0.3.7.6-2&stamp=1484854191&raw=0

    878  This looks like a debian openmpi system.
    879  [zandonai:4650] *** An error occurred in MPI_Comm_dup
    880  [zandonai:4650] *** reported by process [4180410369,0]
    881  [zandonai:4650] *** on communicator MPI_COMM_WORLD
    882  [zandonai:4650] *** MPI_ERR_COMM: invalid communicator
    883  [zandonai:4650] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    884  [zandonai:4650] ***    and potentially your MPI job)

The ppc64 and sparc64 give similar messages, though ppc64 gives a longer traceback:

[ookuninushi:9401] *** An error occurred in MPI_Comm_dup
[ookuninushi:9401] *** reported by process [1293680641,0]
[ookuninushi:9401] *** on communicator MPI_COMM_WORLD
[ookuninushi:9401] *** MPI_ERR_COMM: invalid communicator
[ookuninushi:9401] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ookuninushi:9401] ***    and potentially your MPI job)
[ookuninushi:09397] *** Process received signal ***
[ookuninushi:09397] Signal: Segmentation fault (11)
[ookuninushi:09397] Signal code: Address not mapped (1)
[ookuninushi:09397] Failing at address: 0x30
[ookuninushi:09397] [ 0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x3fffb1250478]
[ookuninushi:09397] [ 1] /usr/lib/powerpc64-linux-gnu/libpmix.so.0(+0x29e54)[0x3fffad789e54]
[ookuninushi:09397] [ 2] /usr/lib/powerpc64-linux-gnu/libpmix.so.0(+0x29c8c)[0x3fffad789c8c]
[ookuninushi:09397] [ 3] /usr/lib/powerpc64-linux-gnu/libpmix.so.0(+0x2a16c)[0x3fffad78a16c]
[ookuninushi:09397] [ 4] /usr/lib/powerpc64-linux-gnu/libpmix.so.0(pmix_rte_finalize-0x3bfb4)[0x3fffad7f1cac]
[ookuninushi:09397] [ 5] /usr/lib/powerpc64-linux-gnu/libpmix.so.0(OPAL_MCA_PMIX3X_PMIx_server_finalize-0x70d00)[0x3fffad7bb298]
[ookuninushi:09397] [ 6] /usr/lib/powerpc64-linux-gnu/openmpi/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_finalize-0x1bf9c)[0x3fffad853a1c]
[ookuninushi:09397] [ 7] /usr/lib/powerpc64-linux-gnu/libopen-rte.so.20(pmix_server_finalize-0x7fa00)[0x3fffb11ab2c0]
[ookuninushi:09397] [ 8] /usr/lib/powerpc64-linux-gnu/openmpi/lib/openmpi/mca_ess_hnp.so(+0x4064)[0x3fffb09d4064]
[ookuninushi:09397] [ 9] /usr/lib/powerpc64-linux-gnu/libopen-rte.so.20(orte_finalize-0xbf7d8)[0x3fffb11697f0]
[ookuninushi:09397] [10] mpirun[0x10001730]
[ookuninushi:09397] [11] mpirun[0x10000f38]
[ookuninushi:09397] [12] /lib/powerpc64-linux-gnu/libc.so.6(+0x46388)[0x3fffb0c86388]
[ookuninushi:09397] [13] /lib/powerpc64-linux-gnu/libc.so.6(__libc_start_main-0x187a18)[0x3fffb0c865d8]
[ookuninushi:09397] *** End of error message ***
Segmentation fault


I suspect that this comes from openmpi or fftw2, since MPI_Comm_dup is not called
directly from mpgrafic.

https://anonscm.debian.org/git/debian-science/packages/fftw.git/tree/mpi/transpose_mpi.c

    102       /* create a new "clone" communicator so that transpose
    103          communications do not interfere with caller communications. */
    104       MPI_Comm_dup(transpose_comm, &comm);

Openmpi has an autoconf parameter to enable dealing with endianness
bugs:

line 930 of openmpi-2.0.2~git.20161225/configure.ac is:
AC_C_BIGENDIAN

My hypothesis of where the bug might lie is that #ifdef WORDS_BIGENDIAN
is needed somewhere in ompi_comm_dup_with_info, or at least somewhere
in relation to lines 967-1025 of 
openmpi-2.0.2~git.20161225/ompi/communicator/comm.c

Thanks to James Clarke for help in the attempted bug trace above!

Cheers
Boud

-- System Information:
Debian Release: sid
Architecture: s390x




More information about the Pkg-openmpi-maintainers mailing list