Bug 3531 - client leader threads can't give up leadership via set_upcall_thread
Summary: client leader threads can't give up leadership via set_upcall_thread
Status: RESOLVED FIXED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.6.7
Hardware: All Linux
: P3 normal
Assignee: DOC Center Support List (internal)
URL:
Depends on: 3569
Blocks:
  Show dependency tree
 
Reported: 2008-12-11 22:33 CST by Russell Mora
Modified: 2011-10-31 05:34 CDT (History)
1 user (show)

See Also:


Attachments
Test case (4.29 KB, application/octet-stream)
2008-12-11 22:36 CST, Russell Mora
Details
Proposed patch (2.73 KB, patch)
2008-12-11 22:40 CST, Russell Mora
Details
New unit test - replaces previous test (33.51 KB, text/plain)
2009-02-12 09:13 CST, Russell Mora
Details
New patch - now works as advertised (mostly) (8.84 KB, patch)
2009-02-12 09:25 CST, Russell Mora
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Russell Mora 2008-12-11 22:33:05 CST
Client leader threads can't give up leadership via the usual TAO_Leader_Follower::set_upcall_thread call - the TAO_Leader_Follower::set_upcall_thread method only works for event loop threads.  As such it is possible to for servers to deadlock, even if they try to be well behaved and call set_upcall_thread.

Attaching a test case the illustrates the problem and a potential fix.  I'd very much like comments on the proposed fix to make sure that I haven't missed anything.
Comment 1 Russell Mora 2008-12-11 22:36:23 CST
Created attachment 1037 [details]
Test case

Simple test to illustrate the problem
Comment 2 Russell Mora 2008-12-11 22:40:23 CST
Created attachment 1038 [details]
Proposed patch

Proposed patch - note, I originally developed this for ACE/TAO OCI X.3a_p10 and didn't look to hard at how Leader_Follower has changed since then, so don't be surprised if you see some dumb mistakes.  It does seem to work though - with this patch the test I attached earlier passes.
Comment 3 Johnny Willemsen 2009-02-11 06:51:28 CST
when I run my regression for 3569 with this patch the server of 3569 doesn't 
hang but loops forever with the log below. seems the patch is not 100% yet
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Foll
[
Comment 4 Johnny Willemsen 2009-02-11 07:41:54 CST
Wed Feb 11 13:40:28 UTC 2009  Johnny Willemsen  <jwillemsen@remedy.nl>

        * tests/Bug_3531_Regression/*:
        * bin/tao_orb_tests.lst:
          Added bug 3531 regression. Thanks to Russell Mora
          <russell_mora at symantec dot com> for creating this test. This
          will fail, no fix integrated at this moment
Comment 5 Russell Mora 2009-02-12 09:13:25 CST
Created attachment 1076 [details]
New unit test - replaces previous test

This is a new unit test that replaces the previous test case.  It tests the TAO_Leader_Follower API directly and thus is more thorough, simpler and more reliable.

Note: This has been written against ACE/TAO X.3a_p10, so it may need porting to the SVN trunk.
Comment 6 Russell Mora 2009-02-12 09:25:50 CST
Created attachment 1077 [details]
New patch - now works as advertised (mostly)

This new patch replaces the previous one which simply did not work as advertised.  The only real difference between this patch and the previous one is that this one gets the scoping of the helper objects right and also client leader threads now release the lock and do a ACE_OS::thr_yield() when there are event loop threads waiting.

This latter part was required because the event loop threads, which are waiting on a cond var, need a chance to acquire the lock and assume leadership after they return from the cond var wait - otherwise the client leader thread loops around, sees that there are no leader threads, and assumes leadership.  Unfortunately the code even its current state is not reliable - I've observed instances where it worked fine and other instances where the event loop threads did not acquire the lock fast enough and thus the client leader thread remained leader.

As such I'm not happy with the code as is and I want to continue to improve it.  Comments are welcome though.
Comment 7 Johnny Willemsen 2009-02-12 09:30:27 CST
(In reply to comment #5)
> Created an attachment (id=1076) [details]
> New unit test - replaces previous test
> 
> This is a new unit test that replaces the previous test case.  It tests the
> TAO_Leader_Follower API directly and thus is more thorough, simpler and more
> reliable.
> 
> Note: This has been written against ACE/TAO X.3a_p10, so it may need porting to
> the SVN trunk.

We should add this as a second test, the more tests, the better
Comment 8 Russell Mora 2009-02-12 09:31:42 CST
Sure, if you like.
Comment 9 Johnny Willemsen 2011-04-17 13:44:11 CDT
I have the new unit test in my workspace
Comment 10 Johnny Willemsen 2011-04-26 06:19:38 CDT
Russ, could you maybe recreate the patch for TAO 2.0.2?
Comment 11 Russell Mora 2011-04-26 14:27:08 CDT
Sorry, I don't have the time or knowledge to port this to TAO 2.0.
Comment 12 Martin Corino 2011-10-31 05:34:25 CDT
This has been fixed by a solution based on Russel's proposal but adjusted for and integrated with changes to also support proper handling of the MT_NOUPCALL policy.

Changes haven been committed in r94873 in the RemedyWork branch and will be merged into trunk before the release of x.0.6.