Please report new issues athttps://github.com/DOCGroup
Client leader threads can't give up leadership via the usual TAO_Leader_Follower::set_upcall_thread call - the TAO_Leader_Follower::set_upcall_thread method only works for event loop threads. As such it is possible to for servers to deadlock, even if they try to be well behaved and call set_upcall_thread. Attaching a test case the illustrates the problem and a potential fix. I'd very much like comments on the proposed fix to make sure that I haven't missed anything.
Created attachment 1037 [details] Test case Simple test to illustrate the problem
Created attachment 1038 [details] Proposed patch Proposed patch - note, I originally developed this for ACE/TAO OCI X.3a_p10 and didn't look to hard at how Leader_Follower has changed since then, so don't be surprised if you see some dumb mistakes. It does seem to work though - with this patch the test I attached earlier passes.
when I run my regression for 3569 with this patch the server of 3569 doesn't hang but loops forever with the log below. seems the patch is not 100% yet Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Foll [
Wed Feb 11 13:40:28 UTC 2009 Johnny Willemsen <jwillemsen@remedy.nl> * tests/Bug_3531_Regression/*: * bin/tao_orb_tests.lst: Added bug 3531 regression. Thanks to Russell Mora <russell_mora at symantec dot com> for creating this test. This will fail, no fix integrated at this moment
Created attachment 1076 [details] New unit test - replaces previous test This is a new unit test that replaces the previous test case. It tests the TAO_Leader_Follower API directly and thus is more thorough, simpler and more reliable. Note: This has been written against ACE/TAO X.3a_p10, so it may need porting to the SVN trunk.
Created attachment 1077 [details] New patch - now works as advertised (mostly) This new patch replaces the previous one which simply did not work as advertised. The only real difference between this patch and the previous one is that this one gets the scoping of the helper objects right and also client leader threads now release the lock and do a ACE_OS::thr_yield() when there are event loop threads waiting. This latter part was required because the event loop threads, which are waiting on a cond var, need a chance to acquire the lock and assume leadership after they return from the cond var wait - otherwise the client leader thread loops around, sees that there are no leader threads, and assumes leadership. Unfortunately the code even its current state is not reliable - I've observed instances where it worked fine and other instances where the event loop threads did not acquire the lock fast enough and thus the client leader thread remained leader. As such I'm not happy with the code as is and I want to continue to improve it. Comments are welcome though.
(In reply to comment #5) > Created an attachment (id=1076) [details] > New unit test - replaces previous test > > This is a new unit test that replaces the previous test case. It tests the > TAO_Leader_Follower API directly and thus is more thorough, simpler and more > reliable. > > Note: This has been written against ACE/TAO X.3a_p10, so it may need porting to > the SVN trunk. We should add this as a second test, the more tests, the better
Sure, if you like.
I have the new unit test in my workspace
Russ, could you maybe recreate the patch for TAO 2.0.2?
Sorry, I don't have the time or knowledge to port this to TAO 2.0.
This has been fixed by a solution based on Russel's proposal but adjusted for and integrated with changes to also support proper handling of the MT_NOUPCALL policy. Changes haven been committed in r94873 in the RemedyWork branch and will be merged into trunk before the release of x.0.6.