[Pkg-openmpi-maintainers] Bug#851918: MPI_Comm_dup error still occurs on s390x, ppc64, sparc64 builds of mpgrafic-0.3.9-1
Boud Roukema
boud-debian at cosmo.torun.pl
Mon Jan 23 11:30:22 UTC 2017
Just for the record: in mpgrafic-0.3.9-1, the symptoms of this
bug remain - mpgrafic-0.3.9-1 fails to build on s390x, ppc64, ppc64, sparc64:
https://buildd.debian.org/status/fetch.php?pkg=mpgrafic&arch=s390x&ver=0.3.9-1&stamp=1485163445&raw=0
FAIL: regression-test-0.3.7.9.sh
================================
Code regression test: does the standard input file generate an
output file identical to that produced by version 0.3.7.9
of mpgrafic? IEEE_UNDERFLOW_FLAG and IEEE_DENORMAL exceptions
are considered acceptable and ignored. Other minor warnings are
ignored too on some systems/implementations.
This looks like a debian openmpi system.
[zemlinsky:3690] *** An error occurred in MPI_Comm_dup
[zemlinsky:3690] *** reported by process [1324089345,0]
[zemlinsky:3690] *** on communicator MPI_COMM_WORLD
[zemlinsky:3690] *** MPI_ERR_COMM: invalid communicator
[zemlinsky:3690] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[zemlinsky:3690] *** and potentially your MPI job)
Warning: A 32^3 mpi run of mpgrafic did not match the expected output.
Equivalent output is on:
https://buildd.debian.org/status/fetch.php?pkg=mpgrafic&arch=ppc64&ver=0.3.9-1&stamp=1485164619&raw=0
https://buildd.debian.org/status/fetch.php?pkg=mpgrafic&arch=sparc64&ver=0.3.9-1&stamp=1485164111&raw=0
with the process lines:
[ookuninushi:8387] *** reported by process [1226178561,0]
[landau:151008] *** reported by process [2975596545,0]
Possible clue: Does the line "*** reported by process [1324089345,0]" mean
that the PID number is assumed to be 1324089345? It sounds extremely high
for a PID. Unsigned 32-bit integers go up to approx 4e9. A build of mpgrafic
should not spawn a billion or so processes :P.
This looks like a report from lines 25-29 of ompi/errhandler/help-mpi-errors.txt
25 [mpi_errors_are_fatal]
26 %s *** An error occurred %s %s
27 %s *** reported by process [%lu,%lu]
28 %s *** on %s %s
29 %s *** %s
Another possible clue: line 988 of ompi/communicator/comm.c
988 rc = ompi_comm_set ( &newcomp, /* new comm */
uses a pointer to a pointer for the new communicator. So a bug related
to fortran-C interfacing (non-use of iso_C_binding?) together with endianness
for a pointer to a pointer might be the source of the bug.
More information about the Pkg-openmpi-maintainers
mailing list