Please report new issues athttps://github.com/DOCGroup
The ORB assumes that upcalls in the leader thread do not dead-lock and are short lived; this may seem like a safe assumption but there are very good use cases where the dead-lock may happens: Assume the user code creates a number of threads to do long lived computations with potentially some CORBA requests in them, the upcall creates this threads then aggregates the responses and finally send them back to the client. If for any reason this upcall happens in the leader thread (and this is likely if several such requests are issued) the server will dead-lock: the user code will not return until all thread will finish but if any of those threads issue another request the thread will not make progress until the leader thread returns to the reactor. We (Irfan and Carlos) believe that we have a solution for this problem, but it is involved and affects the internals of the ORB (and the already complex client-side leader-follower); it is possible to fix it for the 1.0 deadline, but we should think twice before rushing this solution in.
From Irfan: There are some workarounds to your problem: 1. Creating a thread and immediately waiting for it doesn't make much sense. Can you try thread-per-connection or thread-pool? 2. Use the -ORBClientConnectionHandler RW option. This way the Leader/Follower model will not get involved.
This needs to be fixed soon.
Some more information about this bug: - If the leader thread invokes an asynchronous reply handler and such handler invokes a synchronous call the ORB dead-locks. The reason is that the leader thread tries to acquire the leader/follower mutex, but it is held during the upcall. The same solution that applies for the server side should work on this case.
I will be working on this, Irfan has plenty of work to do fixing the POA ;-) ;-)
Accepting the bug. Irfan points out that it is not trivial for event loop threads to drop their leader role: Consider the following scenario: Event loop thread is the running the > event loop. Client threads are waiting for replies. Event loop > thread gives up ownership during upcall. One of the client threads > becomes leader. Event loop thread realizes that the event is really a > reply for a client thread. It signals the client thread and waits to > run the event loop again. Unfortunately, the reply belongs to the > client thread that was chosen as leader. So the signal is "missed" > since the client thread is not there to receive it. That is why we > don't allow client threads to become leaders while a server thread is > available. My current approach would be to only drop the leader role from the event loop thread if the event is a long running process, like receiving an upcall (server-side) or an asynchronous reply (client-side).
More notes from our discussion with Irfan: ------------------------------------------------------------------------------ > Shouldn't we model async replies like upcalls? If we can do this, > we'll solve the above problem. But the low-level components in the ORB (i.e. the pluggable transport thingies) don't know if it is an asynchronous reply or not. Only once you get to the Reply_Dispatcher layer you know what is going on. The problem is that the mutex is acquired like five levels down the stack, by the Waiting_Strategy, and only when using the Leader/Followers approach (in other waiting strategies the mutex is *not* acquired). So I have to pass the Waiting_Strategy from the bottom of the ORB all the way up to the Reply_Dispatcher (this is just below the application). Then I have to make a callback into the Waiting_Strategy, that will release the mutex, and if I want to use one of those Reverse_Locks then the waiting strategy must callback into the Reply_Dispatcher. So we end up with another layer of double dispatching of virtual functions, plus I have to change the interface to the Pluggable Protocol framework (again), and the stack keeps growing. ------------------------------------------------------------------------------ > BTW, the client round trip timeout is honored pretty well currently > where the client threads don't run the event loop. This is not true, > of course, when there are no server threads and the client threads > have to handle upcalls. Right. > So we'll have to strategize this such that the > user can turn off the participation of the client threads as event > loop threads. Yikes! You are probably right, but this darn thing is way too complicated already, it is not going to be pretty if more strategies are involved, oh well, I'll see what I can do. ------------------------------------------------------------------------------ At this point I have the fix for the first kind of deadlock solved in my workspace, the event loop thread relinquishes its leader role when performing an upcall in the server side. I'm fixing the AMI dead-locks, after that the other AMI/event loop deadlocks should become apparent and we can then fix those too... I will probably do an intermediate commit once the repo is open again.
The bug as originally reported has been fixed. There are some issues with long running AMI callbacks, but those are better documented in some other place.
Actually the description of the original bug still applies. If the leader thread invokes an AMI callback and the user code in that callback dead-locks or waits for replies in other threads to complete the system will dead-lock. The solution is similar to what we did for synchronous replies, but it is also related to bug #575, meaning that if 575 was fixed then this problem would be fixed also.
Re-accept bug and add dependency to 575
Thanks to Bala bug #575 has been fixed. With his changes in the leader thread always select a new follower before blocking for long periods of time, and the handles are resumed before going into long upcalls. Therefore I'm closing this bug too.