[Debichem-devel] Bug#531419: mpicc segfaults when called by fakeroot

Steve M. Robbins steve at sumost.ca
Sun Jun 7 16:04:42 UTC 2009


Hi,

Thanks to Jeff's observations, I have a workaround that should suffice
for Debian.  I don't know what the right fix should be, but I hope the
openmpi, fakeroot, and libc6 folks can take it from here.


On Fri, Jun 05, 2009 at 02:39:27PM -0400, Jeff Squyres wrote:
> On Jun 2, 2009, at 4:25 PM, Manuel Prinz wrote:
>
>> I'm putting you in the loop since I'm quite lost here... It would be
>> great if you could throw in your thoughts!
>
> Sorry for the delay in replying; this week has been crazier than most.
>
>> mpicc segfaults when it's called via fakeroot.
>
> What is fakeroot?

A Debian binary package is technically built with root privileges.
Understandably, however, most Debian package maintainers prefer not to
run someone else's source build as root on their machines.  The
utility "fakeroot" uses "LD_PRELOAD and SYSV IPC (or TCP) trickery"
(http://fakeroot.alioth.debian.org/) to provide the illusion of
building with root privileges.

In particular, libc functions like stat() are faked out.


> Ick... I have zero experience with eglibc; this *could* be a  
> compatibility issue...?
>
> In OMPI 1.3.2, we started using the __malloc_initialize_hook  
> functionality to get a function of ours called at the first time the  
> memory allocation subsystem is invoked in a process.  Specifically, we  
> do this:
>
> void (*__malloc_initialize_hook) (void) =
>     opal_memory_ptmalloc2_malloc_init_hook;

Based on this, I took a look at opal_memory_ptmalloc2_malloc_init_hook
and we see:

    /* Look for sentinel files (directories) to see if various network
       drivers are loaded (yes, I know, further abstraction
       violations...).

       * All OpenFabrics devices will have files in
         /sys/class/infiniband (even iWARP)
       * Open-MX doesn't currently use a reg cache, but it might
         someday.  So be conservative and check for /dev/open-mx.
       * MX will have one or more of /dev/myri[0-9].  Yuck.
     */
    if (0 == stat("/sys/class/infiniband", &st) ||
        0 == stat("/dev/open-mx", &st) ||
        0 == stat("/dev/myri0", &st) ||
    [...]

i.e. lots of stat calls.  I was able to avoid the segfault simply by
ifdef'ing out this section (patch attached).  This should suffice in the
short term for Debian on the theory that OpenMPI compatibility with
fakeroot is more important than OpenMPI compatibility with
OpenFabrics.

However, there is clearly a bad interaction between this code, eglibc,
and fakeroot.  Hence the cc's to the various packages.

If we look at the backtrace on a system with most of the debug libs
installed (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=531419) we
can see the calls from ptmalloc_init() through something in
libopen-pal to __xstat() in fakeroot:

(gdb) bt
#0  __libc_calloc (n=1, elem_size=32) at malloc.c:3932
#1  0x00007f12ca8af380 in _dlerror_run (operate=0x7f12ca8af0c0 <dlsym_doit>, args=0x7fffd314af60) at dlerror.c:142
#2  0x00007f12ca8af07a in __dlsym (handle=<value optimized out>, name=<value optimized out>) at dlsym.c:71
#3  0x00007f12cad2b016 in load_library_symbols () from /usr/lib/libfakeroot/libfakeroot-sysv.so
#4  0x00007f12cad2bc11 in tmp___xstat () from /usr/lib/libfakeroot/libfakeroot-sysv.so
#5  0x00007f12cad2af8d in __xstat () from /usr/lib/libfakeroot/libfakeroot-sysv.so
#6  0x00007f12caaea6dd in ?? () from /usr/lib/libopen-pal.so.0
#7  0x00007f12c9d148e1 in ptmalloc_init () at arena.c:571
#8  0x00007f12c9d18997 in malloc_hook_ini (sz=1, caller=0x20) at hooks.c:37
#9  0x00007f12caac2a95 in opal_class_initialize () from /usr/lib/libopen-pal.so.0
#10 0x00007f12caae0de9 in opal_output_init () from /usr/lib/libopen-pal.so.0
#11 0x00007f12caac52b3 in opal_init_util () from /usr/lib/libopen-pal.so.0
#12 0x0000000000402122 in main (argc=1, argv=0x20) at ../../../../../opal/tools/wrappers/opal_wrapper.c:480
(gdb) 

I'm speculating that memory allocation while in the
__malloc_initialize_hook is a bad thing.  Perhaps the stat() in
fakeroot caused a memory allocation, whereas the regular stat() does
not, as this code doesn't segfault in normal use.

As I say, I hope the more knowledgeable folks can figure this out.
I'm way over my head now. ;-)

Thanks,
-Steve

-------------- next part --------------
--- opal/mca/memory/ptmalloc2/hooks.c.orig	2009-06-07 11:02:46.000000000 -0500
+++ opal/mca/memory/ptmalloc2/hooks.c	2009-06-07 10:36:06.000000000 -0500
@@ -725,6 +725,7 @@
     check_result_t lpp = check("OMPI_MCA_mpi_leave_pinned_pipeline");
     bool want_rcache = false, found_driver = false;
 
+#if 0
     /* Look for sentinel files (directories) to see if various network
        drivers are loaded (yes, I know, further abstraction
        violations...).
@@ -749,6 +750,7 @@
         0 == stat("/dev/myri9", &st)) {
         found_driver = true;
     }
+#endif
     
     /* Simple combination of the results of these two environment
        variables (if both "yes" and "no" are specified, then be
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/debichem-devel/attachments/20090607/84d97322/attachment.pgp>


More information about the Debichem-devel mailing list