Please report new issues athttps://github.com/DOCGroup
A multi-threaded client using timeouts on each thread can dead-lock the ORB, a regression test (tests/MT_Timeout) will be commited shortly. Meanwhile the best clue is that the following internal error message is printed: TAO (4185|3076) TAO_Wait_On_Leader_Follower::wait - remove_follower failed for <80b86e0> the regression test does not generate the message on every run. But it happens often enough. I think this should be a blocker because 132 depends on it (and 132 is itself a blocker.) NOTE: Even in this case there is no excuse to print an error message if ACE_debug_level is not set!
Many of the changes to fix 132 cannot be fixed if this bug is not addressed.
A few more details about this bug: I believe the Wait_On_Leader_Followers does not consider the scenario where thread A is a follower thread waiting for a reply with some timeout. Meanwhile thread B is the leader, it completes its work and signals A to become the new leader. However, before A realizes this it wakes up because of the timeout. The remove_follower() operation fails because it has already been removed by thread B (as part of the promotion to leader role), but even worse, thread A now simply returns, without electing a new leader. The ORB is now deadlocked unless another thread joins the leader/follower set. This is just a guess based on a brief inspection of the code, so take it with a grain of salt. If this is true probably the best fix is to use a guard idiom to ensure that a new leader is elected, no matter what exit path is used from the wait() operation.
Assigned to Michael, he seems to have fixed it already.
Carlos and I think we resolved this bug, though we were not able to reproduce the deadlock without this fix. Thu Apr 5 13:50:00 2001 Michael Kircher <Michael.Kircher@mchp.siemens.de> * tao/Wait_On_Leader_Follower.cpp: Carlos and I worked on a fix for [Bug 842]. This involves electing a new leader if a follower got elected but concurrently also received a timeout. In such cases a dead-lock could occur, but this fix prevents it from happening, as the leaving follower elects a new leader, as itself cannot fulfill that role. Such situations are detected by the condition that a timeout occured, remove_follower fails and reply_received equals 0. This fix applies exactly in such situations.