This is sources Bugzilla
Bugzilla Version 2.17.5
Bugzilla Bug 654
  Cancelling nptl thread on dlclose() leads to application hangup Last modified: 2006-05-02 22:03
     Query page      Enter new bug
Bug#: 654   Hardware:   Reporter: Alexei Khlebnikov <alexei.khlebnikov@datacon.at>
Host: Target: Build:
Product:     Add CC:
Component:   Version:   CC:
Remove selected CCs
Status: NEW   Priority:  
Resolution:   Severity:  
Assigned To: Ulrich Drepper <drepper@redhat.com>   Target Milestone:  
Flags: Requestee:
  backport ()
  examined ()
  testsuite ()
Summary:
Keywords:

Attachment Description Type Created Actions
dlopen-and-thread-bug-testcase-2005-01-12.tar.bz2 Testcase for the bug. application/octet-stream 2005-01-12 10:49 Edit None
glibc-dlclose-unlock.patch Proposed patch, the first try. patch 2005-01-17 12:47 Edit | Diff
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 654 depends on: Show dependency tree
Show dependency graph
Bug 654 blocks:

Additional Comments:


Leave as NEW 
Mark bug as waiting for feedback
Mark bug as suspended
Accept bug (change status to ASSIGNED)
Resolve bug, changing resolution to
Resolve bug, mark it as duplicate of bug #
Reassign bug to
Reassign bug to owner of selected component

View Bug Activity   |   Format For Printing


Description:   Last confirmed: 0000-00-00 00:00 Opened: 2005-01-12 10:47
Overview Description:
The program loads a module (.so-library) using dlopen(). During this action a
global C++-object is created. No matter how is it created - as a global stack
variable or as a new()-ly created object using __attribute__ ((constructor))
function - in either case the bug is triggered. The constructor of this object
spawns a thread. Then the program unloads the dynamically-loaded module. A
destructor of the mentioned object is called, it calls a function, which tries
to cancel the mentioned spawned thread. The thread is of type
PTHREAD_CANCEL_DEFERRED and periodically checks for its cancelling by
pthread_testcancel(), so it catches the the cancellatiob request. The main
thread calls pthread_join() to join the second thread and the whole program
hangs up! If the function which cancel the second thread is called explicitly
(not from the destructor) before the module unloading, the second thread cancels
and joins fine.


Steps to Reproduce:
1) Unpack the attached tarball. It is the trimmed-down testcase
of the actual big application.
2) Run "./compile" to compile the test program and the module.
3) Run "./run" to see messages and the program hangup.
4) Press Ctrl-C to reclaim the command prompt.
5) Run "./test ./libmodule.so foo" to see a normal program behaviour
in case of explicit thread cancelling.


Actual Results:
1) Output of running "./run" or "./test ./libmodule.so".
---
$ ./run
loading ./libtestmod.so now
Constructor called
hi there, new thread is up and running, thread id is -1210377296
Constructor finished
pureShutdown::func(void*) called
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
unloading ./libtestmod.so now
Destructor called
modShutdown() called
bye, cancelling down thread -1210377296
running pthread_join(g_tid, &result) ...
---
(the program hangs here)

2) Output of running "./test ./libmodule.so foo".
---
$ ./test ./libtestmod.so foo
loading ./libtestmod.so now
Constructor called
hi there, new thread is up and running, thread id is -1210377296
Constructor finished
pureShutdown::func(void*) called
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
modShutdown() called
bye, cancelling down thread -1210377296
running pthread_join(g_tid, &result) ...
returned from pthread_join(g_tid, &result) !
all's well that end's well
modShutdown() finished
unloading ./libtestmod.so now
Destructor called
modShutdown() called
Destructor finished
---
(the program exits with code 0 here)

3) GDB session of the first case (running "./run" or "./test ./libmodule.so").
---
$ gdb
GNU gdb 6.1.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
(gdb) file ./test
Reading symbols from ./test...done.
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) run ./libtestmod.so
Starting program: /home/ses/test/test ./libtestmod.so
[Thread debugging using libthread_db enabled]
[New Thread -1210374480 (LWP 2594)]
loading ./libtestmod.so now
Constructor called
[New Thread -1210377296 (LWP 2597)]
hi there, new thread is up and running, thread id is -1210377296
Constructor finished
pureShutdown::func(void*) called
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
= thread -1210377296 is still running...
unloading ./libtestmod.so now
Destructor called
modShutdown() called
bye, cancelling down thread -1210377296
running pthread_join(g_tid, &result) ...

Program received signal SIG32, Real-time event 32.
[Switching to Thread -1210377296 (LWP 2597)]
0xffffe410 in ?? ()
(gdb) bt
#0  0xffffe410 in ?? ()
#1  0xb7db1468 in ?? ()
#2  0xb7fd6ff8 in ?? () from /lib/libpthread.so.0
#3  0x00000000 in ?? ()
#4  0xb7fd2cf6 in __nanosleep_nocancel () from /lib/libpthread.so.0
#5  0xb7fe3ddd in pureShutdown::func () at module.cpp:71
#6  0xb7fcd3c0 in start_thread () from /lib/libpthread.so.0
#7  0xb7e6c24e in clone () from /lib/libc.so.6
(gdb) kill
Kill the program being debugged? (y or n) y
(gdb) quit
$
---
kill -l haven't print what the SIG32 is. Google said that it is SIGTRAP.


Expected Results: the program in the first case should not hang up, but the
second thread should terminate correctly, the module should be unloaded
correctly and the whole program should exit with code 0.


Build Date: 2004-01-12


System information:
Processor: Pentium III (Coppermine) 667.080 Mhz
Distribuition: Linux From Scratch 6.0 with RPM and some packages updated
Kernel version: 2.6.9, unpatched AFAIK

Glibc version: 
Snapshot of 2005-01-10 from
ftp://sources.redhat.com/pub/glibc/snapshots/glibc-20050110.tar.bz2:
---
GNU C Library development release version 2.3.90, by Roland McGrath et al.
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.4.3.
Compiled on a Linux 2.6.9 system on 2005-01-11.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Thread-local storage support included.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
---
Sorry for not trying the latest CVS. I haven't got an access to the outside
network CVS from my corporate network. And judjing from
[glibc]/libc/nptl/ChangeLog on CvsWeb, nothing changed during the last 2 days in
the nptl.

Glibc "./configure" switches (excluding "--*dir=" switches):
---
    --disable-profile \
    --enable-add-ons=nptl \
    --with-tls \
    --with-__thread \
    --enable-kernel=2.6.9 \
    --without-cvs \
    --with-headers=/usr/src/linux-2.6.9/include
---
Glibc was built into the rpm packages with the aid of rpm.

GCC version:
---
$ gcc -v
Reading specs from /usr/lib/gcc/i686-pc-linux-gnu/3.4.3/specs
Configured with: ../gcc-3.4.3/configure --host=i686-pc-linux-gnu
--build=i686-pc-linux-gnu --target=i686-pc-linux-gnu --prefix=/usr
--exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc
--datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib
--libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/usr/com
--mandir=/usr/share/man --infodir=/usr/share/info --enable-shared
--enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
--enable-languages=c,c++
Thread model: posix
gcc version 3.4.3
---

Ld/Binutils version:
---
$ ld -v
GNU ld version 2.15.91.0.1 20040527
---

Hoping, all the provided information will help. If you need more info - please
feel free to ask. Also feel free to request additional testing/investigation.
And also an advice would be helpful how to write the patch myself.

------- Additional Comment #1 From Alexei Khlebnikov 2005-01-12 10:49 -------
Created an attachment (id=350)
Testcase for the bug.

------- Additional Comment #2 From Alexei Khlebnikov 2005-01-13 12:30 -------
I've tested the same testcase on another system, having kernel 2.4.20 and glibc
2.3.2 with linuxthreads. The program ran just fine. The test has been conducted
today, 2004-01-13.

The output:
---
$ ./run
loading ./libtestmod.so now
Constructor called
pureShutdown::func(void*) called
hi there, new thread is up and running, thread id is 16386
Constructor finished
= thread 16386 is still running...
= thread 16386 is still running...
= thread 16386 is still running...
= thread 16386 is still running...
unloading ./libtestmod.so now
Destructor called
modShutdown() called
bye, cancelling down thread 16386
running pthread_join(g_tid, &result) ...
returned from pthread_join(g_tid, &result) !
all's well that end's well
modShutdown() finished
Destructor finished
$
---

System information:
CPU: Intel(R) Xeon(TM) CPU 2.80GHz
Distribution: SuSE Linux 8.2
Kernel: 2.4.20-64GB-SMP, from the SuSE distribution

Glibc version:
---
$ /lib/libc.so.6
GNU C Library stable release version 2.3.2, by Roland McGrath et al.
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 3.3 20030226 (prerelease) (SuSE Linux).
Compiled on a Linux 2.4.20 system on 2003-03-13.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        linuxthreads-0.10 by Xavier Leroy
        NoVersion patch for broken glibc 2.0 binaries
        BIND-8.2.3-T5B
        libthread_db work sponsored by Alpha Processor Inc
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Report bugs using the `glibcbug' script to <bugs@gnu.org>.
---

GCC version:
---
$ gcc -v
Reading specs from /usr/lib/gcc-lib/i486-suse-linux/3.3/specs
Configured with: ../configure --enable-threads=posix --prefix=/usr
--with-local-prefix=/usr/local --infodir=/usr/share/info --mandir=/usr/share/man
--libdir=/usr/lib --enable-languages=c,c++,f77,objc,java,ada --disable-checking
--enable-libgcj --with-gxx-include-dir=/usr/include/g++ --with-slibdir=/lib
--with-system-zlib --enable-shared --enable-__cxa_atexit i486-suse-linux
Thread model: posix
gcc version 3.3 20030226 (prerelease) (SuSE Linux)
---

Ld/Binutils version:
---
$ ld -v
GNU ld version 2.13.90.0.18 20030121 (SuSE Linux)
---

------- Additional Comment #3 From Alexei Khlebnikov 2005-01-13 12:52 -------
I've investigated the problem further.
I've found a (not very precise) place in the libc where the hangup takes place.
It's the file nptl/pthread_join.c, line 86, which looks like
---
  /* Wait for the child.  */
  lll_wait_tid (pd->tid);
---

lll_wait_tid is a macro with assembler code which I don't understand so far:
---
/* The kernel notifies a process with uses CLONE_CLEARTID via futex
   wakeup when the clone terminates.  The memory location contains the
   thread ID while the clone is running and is reset to zero
   afterwards.

   The macro parameter must not have any side effect.  */
#define lll_wait_tid(tid) \
  do {									      \
    int __ignore;							      \
    register __typeof (tid) _tid asm ("edx") = (tid);			      \
    if (_tid != 0)							      \
      __asm __volatile (LLL_EBX_LOAD					      \
			"1:\tmovl %1, %%eax\n\t"			      \
			LLL_ENTER_KERNEL				      \
			"cmpl $0, (%%ebx)\n\t"				      \
			"jne,pn 1b\n\t"					      \
			LLL_EBX_LOAD					      \
			: "=&a" (__ignore)				      \
			: "i" (SYS_futex), LLL_EBX_REG (&tid), "S" (0),	      \
			  "c" (FUTEX_WAIT), "d" (_tid),			      \
			  "i" (offsetof (tcbhead_t, sysinfo)));		      \
  } while (0)
---

------- Additional Comment #4 From Jakub Jelinek 2005-01-13 13:15 -------
This is the same deadlock as has been fixed by:
2004-07-07  Ulrich Drepper  <drepper@redhat.com>

        * elf/dl-fini.c (_dl_fini): Move the unlock of the ld.so lock
        before the loop running the destructors.
for destructors that are run at exit time.
ATM ld.so holds dl_load_lock when running shared library destructors and the
same lock is used indirectly by libgcc_s.so when unwinding.  If you call
pthread_cancel in a shared library destructor that is run during dlclose,
dl_load_lock is held in the thread calling pthread_cancel, but the cancelled
thread needs to be unwound.  As you also call pthread_join in the same destructor
that waits for the cancelled thread and the cancelled thread is waiting until
dl_load_lock is released (this would happen when dlclose is about to return),
they are deadlocking.

The fix is avoid running shared library destructors with dl_load_lock held,
but that's certainly not trivial.

------- Additional Comment #5 From Alexei Khlebnikov 2005-01-17 12:43 -------
I've constructed a patch that unlocks dl_load_lock just before running the
destructors, and locks it again just after that. My testcase now runs properly,
but I don't know wether or not my patch has any side-effects. So, dear glibc
developers, please watch it and either confirm that the patch is correct or
point me where am I wrong. Thanks.

The patch:
---
--- glibc/elf/dl-close.c.orig   2005-01-09 09:27:52.000000000 +0100
+++ glibc/elf/dl-close.c        2005-01-17 15:04:52.000000000 +0100
@@ -265,6 +265,10 @@
       }
   assert (new_opencount[0] == 0);

+  /* Release dl_load_lock during running destructors,
+     like in dl-fini.c. */
+  __rtld_lock_unlock_recursive (GL(dl_load_lock));
+
   /* Call all termination functions at once.  */
 #ifdef SHARED
   bool do_audit = GLRO(dl_naudit) > 0 && !GL(dl_ns)[ns]._ns_loaded->l_auditing;
@@ -389,6 +393,9 @@
       assert (imap->l_type == lt_loaded || imap->l_opencount > 0);
     }

+  /* Destructors finished, acquire dl_load_lock again. */
+  __rtld_lock_lock_recursive (GL(dl_load_lock));
+
 #ifdef SHARED
   /* Auditing checkpoint: we will start deleting objects.  */
   if (__builtin_expect (do_audit, 0))
---

------- Additional Comment #6 From Alexei Khlebnikov 2005-01-17 12:47 -------
Created an attachment (id=369)
Proposed patch, the first try.

This is the same patch as listed in the comment #5.

------- Additional Comment #7 From Ulrich Drepper 2006-05-02 22:03 -------
Nothing related to C++, exceptions, and dlopen can be critical.

     Query page      Enter new bug
Actions: New | Query | bug # | Reports | Requests   New Account | Log In