[Pkg-openmpi-maintainers] Bug#851918: MPI_Comm_dup error still occurs on s390x, ppc64, sparc64 builds of mpgrafic-0.3.9-1

Boud Roukema boud-debian at cosmo.torun.pl
Mon Jan 23 11:30:22 UTC 2017


Just for the record: in mpgrafic-0.3.9-1, the symptoms of this
bug remain - mpgrafic-0.3.9-1 fails to build on s390x, ppc64, ppc64, sparc64:

https://buildd.debian.org/status/fetch.php?pkg=mpgrafic&arch=s390x&ver=0.3.9-1&stamp=1485163445&raw=0

FAIL: regression-test-0.3.7.9.sh
================================

Code regression test: does the standard input file generate an
output file identical to that produced by version 0.3.7.9
of mpgrafic? IEEE_UNDERFLOW_FLAG and IEEE_DENORMAL exceptions
are considered acceptable and ignored. Other minor warnings are
ignored too on some systems/implementations.

This looks like a debian openmpi system.
[zemlinsky:3690] *** An error occurred in MPI_Comm_dup
[zemlinsky:3690] *** reported by process [1324089345,0]
[zemlinsky:3690] *** on communicator MPI_COMM_WORLD
[zemlinsky:3690] *** MPI_ERR_COMM: invalid communicator
[zemlinsky:3690] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[zemlinsky:3690] ***    and potentially your MPI job)
Warning: A 32^3 mpi run of mpgrafic did not match the expected output.

Equivalent output is on:

https://buildd.debian.org/status/fetch.php?pkg=mpgrafic&arch=ppc64&ver=0.3.9-1&stamp=1485164619&raw=0

https://buildd.debian.org/status/fetch.php?pkg=mpgrafic&arch=sparc64&ver=0.3.9-1&stamp=1485164111&raw=0

with the process lines:

[ookuninushi:8387] *** reported by process [1226178561,0]

[landau:151008] *** reported by process [2975596545,0]


Possible clue: Does the line "*** reported by process [1324089345,0]" mean
that the PID number is assumed to be 1324089345? It sounds extremely high
for a PID. Unsigned 32-bit integers go up to approx 4e9. A build of mpgrafic
should not spawn a billion or so processes :P.

This looks like a report from lines 25-29 of ompi/errhandler/help-mpi-errors.txt

     25  [mpi_errors_are_fatal]
     26  %s *** An error occurred %s %s
     27  %s *** reported by process [%lu,%lu]
     28  %s *** on %s %s
     29  %s *** %s


Another possible clue: line 988 of ompi/communicator/comm.c

  988      rc =  ompi_comm_set ( &newcomp,                               /* new comm */

uses a pointer to a pointer for the new communicator. So a bug related
to fortran-C interfacing (non-use of iso_C_binding?) together with endianness
for a pointer to a pointer might be the source of the bug.




More information about the Pkg-openmpi-maintainers mailing list