[Debichem-devel] Bug#531419: mpicc segfaults when called by fakeroot

Jeff Squyres jsquyres at cisco.com
Fri Jun 5 18:39:27 UTC 2009


On Jun 2, 2009, at 4:25 PM, Manuel Prinz wrote:

> I'm putting you in the loop since I'm quite lost here... It would be
> great if you could throw in your thoughts!

Sorry for the delay in replying; this week has been crazier than most.

> mpicc segfaults when it's called via fakeroot.

What is fakeroot?

> Since this tool is needed
> in the build process of Debian packages, packages depending on Open  
> MPI
> fail to build. This blocks our transition to 1.3.2. Nicholas was so  
> kind
> to investigate the issue; his results are quoted below.
>
> As far as I can say, the problem appeared somewhere between Open MPI  
> 1.3
> and 1.3.2. I also successfully used mpicc with fakeroot before Debian
> switched to eglibc, so this might be the cause. (Though it should be
> fully compatible to glibc. At least they claim to be.)
>
> Do you have some idea what might go wrong?

Based on the call stack, yes.  Uccckkk!!  More below...

FWIW, debugging OMPI is easier if you tell OMPI to slurp all the  
plugins into its libraries -- so there's no dlopen's and all the  
plugins are physically located in libmpi.so (and friends).  You can  
get better call stacks this way from corefiles, etc.

Although I think a lot of those missing symbols are in glibc, not ompi.

>> Okay, it looks like the root cause is something that's appeared  
>> recently in
>> openmpi - it fails under 1.3.2-2, but works under 1.3-2.  Manuel,  
>> I'm cloning
>> the bug for tracking purposes, but I'm certainly not sure that it's  
>> actually an
>> OpenMPI bug at heart.  Have you seen anything else like this?
>>
>> % echo "int main(void) { return 0; }" > test.c
>> % mpicc.openmpi test.c ; echo $?
>> 0
>> % fakeroot mpicc.openmpi test.c ; echo $?
>> Segmentation fault
>> 139
>>
>> No failures with the other MPI implementations, nor with OpenMPI  
>> 1.3.  I can
>> put it under gdb but am missing some debugging libraries in the  
>> middle:
>>
>> % fakeroot gdb mpicc.openmpi
>> [...]
>> (gdb) run test.c
>> Starting program: /usr/bin/mpicc.openmpi conftest.c
>> [Thread debugging using libthread_db enabled]
>> [New Thread 0xb7dc46c0 (LWP 6958)]
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread 0xb7dc46c0 (LWP 6958)]
>> __libc_calloc (n=1, elem_size=20) at malloc.c:3932
>> 3932    malloc.c: No such file or directory.
>>        in malloc.c
>> (gdb) bt
>> #0  __libc_calloc (n=1, elem_size=20) at malloc.c:3932
>> #1  0xb7f83086 in _dlerror_run (operate=0xb7f82d90 <dlsym_doit>,  
>> args=0xbfd3002c) at dlerror.c:142
>> #2  0xb7f82d43 in __dlsym (handle=0xffffffff, name=0xb800f16a  
>> "open") at dlsym.c:71
>> #3  0xb800db73 in load_library_symbols () from /usr/lib/libfakeroot/ 
>> libfakeroot-sysv.so
>> #4  0xb800e687 in tmp___xstat () from /usr/lib/libfakeroot/ 
>> libfakeroot-sysv.so
>> #5  0xb800daa3 in __xstat () from /usr/lib/libfakeroot/libfakeroot- 
>> sysv.so
>> #6  0xb7fbcefc in ?? () from /usr/lib/libopen-pal.so.0
>> #7  0x00000003 in ?? ()
>> #8  0xb7fc948e in ?? () from /usr/lib/libopen-pal.so.0
>> #9  0xbfd300e4 in ?? ()
>> #10 0x00001b2e in ?? ()
>> #11 0xbfd300e4 in ?? ()
>> #12 0x00000003 in ?? ()
>> #13 0x00000003 in ?? ()
>> #14 0xbfd30164 in ?? ()
>> #15 0xb7f20ff4 in ?? () from /lib/i686/cmov/libc.so.6
>> #16 0x00000001 in ?? ()
>> #17 0xb7dc8d0c in ?? () from /lib/i686/cmov/libc.so.6
>> #18 0xbfd30148 in ?? ()
>> #19 0xb7ee4429 in *__GI__dl_addr (address=0xb7e34e70,  
>> info=0xbfd30184, mapp=0xbfd30194,
>>    symbolp=0xb7f2260c) at dl-addr.c:146
>> #20 0xb7e35096 in ptmalloc_init () at arena.c:571
>> #21 0xb7e386bc in malloc_hook_ini (sz=12, caller=0xb7f939ab) at  
>> hooks.c:37
>> #22 0xb7e38535 in *__GI___libc_malloc (bytes=12) at malloc.c:3546
>> #23 0xb7f939ab in opal_class_initialize () from /usr/lib/libopen- 
>> pal.so.0
>> #24 0xb7fb3227 in opal_output_init () from /usr/lib/libopen-pal.so.0
>> #25 0xb7f96205 in opal_init_util () from /usr/lib/libopen-pal.so.0
>> #26 0x08049b62 in main (argc=2, argv=0xbfd30404) at ../../../../../ 
>> opal/tools/wrappers/opal_wrapper.c:480

Ick... I have zero experience with eglibc; this *could* be a  
compatibility issue...?

In OMPI 1.3.2, we started using the __malloc_initialize_hook  
functionality to get a function of ours called at the first time the  
memory allocation subsystem is invoked in a process.  Specifically, we  
do this:

void (*__malloc_initialize_hook) (void) =
     opal_memory_ptmalloc2_malloc_init_hook;

which sets up opal_memory_ptmalloc_malloc_init_hook() to be invoked  
during the memory subsystem's init (sometimes even pre-main).  This  
function is in opal/mca/memory/ptmalloc2/hooks.c.  Note that this is a  
*different* hooks.c than is listed at #21 in the stack trace above.   
It looks like that is the ptmalloc2 hooks.c that is in elibc, and it  
is calling the elibc ptmalloc_init() which should then be calling our  
init hook function (opal_memory_ptmalloc2_malloc_init_hook).  Can you  
step throught and see what is happening there?

I wonder if there's a bug in elibc such that when it's looking up this  
symbol, it's trying to open libopen-pal.so to find that symbol, and  
something is going bad in there...?

It's weird that the gdb #2 is a dlsym of -1 (self) and it's looking  
for the symbol "open"...?  I don't know enough about how dlsym works  
internally --perhaps that's normal...?

-- 
Jeff Squyres
Cisco Systems






More information about the Debichem-devel mailing list