Bug 175 - ORB can dead-lock if leader thread does
Summary: ORB can dead-lock if leader thread does
Status: RESOLVED FIXED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.1.3
Hardware: All All
: P1 normal
Assignee: Carlos O'Ryan
URL:
Depends on: 575
Blocks: 266
  Show dependency tree
 
Reported: 1999-07-29 00:53 CDT by Carlos O'Ryan
Modified: 2001-08-01 13:28 CDT (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos O'Ryan 1999-07-29 00:53:54 CDT
The ORB assumes that upcalls in the leader thread do not dead-lock and are short
lived; this may seem like a safe assumption but there are very good use cases
where the dead-lock may happens:
Assume the user code creates a number of threads to do long lived computations
with potentially some CORBA requests in them, the upcall creates this threads
then aggregates the responses and finally send them back to the client.
If for any reason this upcall happens in the leader thread (and this is likely
if several such requests are issued) the server will dead-lock: the user code
will not return until all thread will finish but if any of those threads issue
another request the thread will not make progress until the leader thread
returns to the reactor.

	We (Irfan and Carlos) believe that we have a solution for this problem, but it
is involved and affects the internals of the ORB (and the already complex
client-side leader-follower); it is possible to fix it for the 1.0 deadline, but
we should think twice before rushing this solution in.
Comment 1 Carlos O'Ryan 1999-08-09 17:34:59 CDT
From Irfan:

There are some workarounds to your problem:

1. Creating a thread and immediately waiting for it doesn't make much
sense.  Can you try thread-per-connection or thread-pool?

2. Use the -ORBClientConnectionHandler RW option. This way the
Leader/Follower model will not get involved.
Comment 2 Irfan Pyarali 1999-09-09 17:36:59 CDT
This needs to be fixed soon.
Comment 3 Carlos O'Ryan 2000-06-06 17:05:07 CDT
Some more information about this bug:
- If the leader thread invokes an asynchronous reply handler and such handler
invokes a synchronous call the ORB dead-locks.  The reason is that the leader
thread tries to acquire the leader/follower mutex, but it is held during the
upcall.
The same solution that applies for the server side should work on this case.
Comment 4 Carlos O'Ryan 2000-06-13 10:06:04 CDT
I will be working on this, Irfan has plenty of work to do fixing the POA ;-) ;-)
Comment 5 Carlos O'Ryan 2000-06-13 10:09:50 CDT
Accepting the bug.

Irfan points out that it is not trivial for event loop threads to drop their
leader role:

 Consider the following scenario: Event loop thread is the running the
 > event loop.  Client threads are waiting for replies.  Event loop
 > thread gives up ownership during upcall.  One of the client threads
 > becomes leader.  Event loop thread realizes that the event is really a
 > reply for a client thread.  It signals the client thread and waits to
 > run the event loop again.  Unfortunately, the reply belongs to the
 > client thread that was chosen as leader.  So the signal is "missed"
 > since the client thread is not there to receive it.  That is why we
 > don't allow client threads to become leaders while a server thread is
 > available.

	My current approach would be to only drop the leader role from the
event loop thread if the event is a long running process, like receiving an
upcall (server-side) or an asynchronous reply (client-side).
Comment 6 Carlos O'Ryan 2000-06-14 11:20:03 CDT
More notes from our discussion with Irfan:

------------------------------------------------------------------------------
 > Shouldn't we model async replies like upcalls?  If we can do this,
 > we'll solve the above problem.

 But the low-level components in the ORB (i.e. the pluggable transport
thingies) don't know if it is an asynchronous reply or not.  Only
once you get to the Reply_Dispatcher layer you know what is going on.
The problem is that the mutex is acquired like five levels down the
stack, by the Waiting_Strategy, and only when using the
Leader/Followers approach (in other waiting strategies the mutex is
*not* acquired).  So I have to pass the Waiting_Strategy
from the bottom of the ORB all the way up to the Reply_Dispatcher
(this is just below the application).
      Then I have to make a callback into the Waiting_Strategy, that
will release the mutex, and if I want to use one of those
Reverse_Locks then the waiting strategy must callback into the
Reply_Dispatcher.

	So we end up with another layer of double dispatching of
virtual functions, plus I have to change the interface to the
Pluggable Protocol framework (again), and the stack keeps growing.
------------------------------------------------------------------------------

 > BTW, the client round trip timeout is honored pretty well currently
 > where the client threads don't run the event loop.  This is not true,
 > of course, when there are no server threads and the client threads
 > have to handle upcalls.

 Right.

 > So we'll have to strategize this such that the
 > user can turn off the participation of the client threads as event
 > loop threads.

 Yikes!  You are probably right, but this darn thing is way too
complicated already, it is not going to be pretty if more strategies
are involved, oh well, I'll see what I can do.
------------------------------------------------------------------------------

	At this point I have the fix for the first kind of deadlock solved in
my workspace, the event loop thread relinquishes its leader role when performing
an upcall in the server side.  I'm fixing the AMI dead-locks, after that the 
other AMI/event loop deadlocks should become apparent and we can then fix those
too...

	I will probably do an intermediate commit once the repo is open again.
Comment 7 Carlos O'Ryan 2000-06-16 16:49:30 CDT
The bug as originally reported has been fixed.
There are some issues with long running AMI callbacks, but those are better
documented in some other place.
Comment 8 Carlos O'Ryan 2000-06-16 17:52:38 CDT
Actually the description of the original bug still applies.
If the leader thread invokes an AMI callback and the user code in that callback
dead-locks or waits for replies in other threads to complete the system will
dead-lock.
The solution is similar to what we did for synchronous replies, but it is
also related to bug #575, meaning that if 575 was fixed then this problem would
be fixed also.
Comment 9 Carlos O'Ryan 2000-06-16 17:53:18 CDT
Re-accept bug and add dependency to 575
Comment 10 Carlos O'Ryan 2001-08-01 13:28:40 CDT
Thanks to Bala bug #575 has been fixed.  With his changes in the leader thread
always select a new follower before blocking for long periods of time, and the
handles are resumed before going into long upcalls.
Therefore I'm closing this bug too.