Bug 842 - Timeouts in multi-threaded apps can dead-lock the ORB
Summary: Timeouts in multi-threaded apps can dead-lock the ORB
Status: RESOLVED FIXED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.1.13
Hardware: All All
: P3 blocker
Assignee: Michael Kircher
URL:
Depends on:
Blocks: 132
  Show dependency tree
 
Reported: 2001-03-30 14:11 CST by Carlos O'Ryan
Modified: 2001-04-06 00:16 CDT (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos O'Ryan 2001-03-30 14:11:19 CST
A multi-threaded client using timeouts on each thread can dead-lock the ORB, a
regression test (tests/MT_Timeout) will be commited shortly.  Meanwhile the best
clue is that the following internal error message is printed:

TAO (4185|3076) TAO_Wait_On_Leader_Follower::wait - remove_follower failed for
<80b86e0>

the regression test does not generate the message on every run. But it happens
often enough.  I think this should be a blocker because 132 depends on it (and
132 is itself a blocker.)

NOTE: Even in this case there is no excuse to print an error message if
ACE_debug_level is not set!
Comment 1 Carlos O'Ryan 2001-03-30 14:12:52 CST
Many of the changes to fix 132 cannot be fixed if this bug is not addressed.
Comment 2 Carlos O'Ryan 2001-04-02 13:47:38 CDT
A few more details about this bug:  I believe the Wait_On_Leader_Followers does
not consider the scenario where thread A is a follower thread waiting for a
reply with some timeout.  Meanwhile thread B is the leader, it completes its
work and signals A to become the new leader.  However, before A realizes this it
wakes up because of the timeout.

The remove_follower() operation fails because it has already been removed by
thread B (as part of the promotion to leader role), but even worse, thread A now
simply returns, without electing a new leader.

The ORB is now deadlocked unless another thread joins the leader/follower set.

 This is just a guess based on a brief inspection of the code, so take it with a
grain of salt.  If this is true probably the best fix is to use a guard idiom to
ensure that a new leader is elected, no matter what exit path is used from the
wait() operation.
Comment 3 Carlos O'Ryan 2001-04-05 20:04:34 CDT
Assigned to Michael, he seems to have fixed it already.
Comment 4 Michael Kircher 2001-04-06 00:16:58 CDT
Carlos and I think we resolved this bug, though we were not able to reproduce
the deadlock without this fix.

Thu Apr  5 13:50:00 2001  Michael Kircher <Michael.Kircher@mchp.siemens.de>

        * tao/Wait_On_Leader_Follower.cpp:

          Carlos and I worked on a fix for [Bug 842]. This involves
          electing a new leader if a follower got elected but 
          concurrently also received a timeout. In such cases
          a dead-lock could occur, but this fix prevents it 
          from happening, as the leaving follower elects a new
          leader, as itself cannot fulfill that role.

          Such situations are detected by the condition that 
          a timeout occured, remove_follower fails and reply_received
          equals 0. This fix applies exactly in such situations.