Summary: | Deadlock in wait_for_non_servant_upcall_to_complete() | ||
---|---|---|---|
Product: | TAO | Reporter: | klyk |
Component: | POA | Assignee: | klyk |
Status: | NEW --- | ||
Severity: | critical | CC: | bala |
Priority: | P3 | ||
Version: | 1.4.2 | ||
Hardware: | SPARC | ||
OS: | Solaris |
Description
klyk
2004-09-16 02:21:31 CDT
From: Balachandran Natarajan <bala@cs.wustl.edu> > What is intruiging is that the first thread, which starts out as a > regular "servant upcall", appears to mutate into a "non-servant upcall" > once the response has been sent back to the client. I assume that this > happens since we are dealing with etherialisation of the servant or > because we are in POA code which is post-response. The second thread > however is unable to proceed with the invocation on the second > (unrelated) object because it requires that no "non-servant" > upcall is in progress. Right! The non-servant upcall in this case is the _remove_ref () operation. This deadlock is evil. Can you please stick this bug report in the bugzilla? For the timebeing you could work around by using a oneway call to the another object. Could you also make up a simple test case. We will try to address this by x.4.4? Getting it into x.4.3 may be hard. Another strange thing is that -- the call seems to be going remote when you have the object in the same process space. Evil! Is this the problem you reported before? This shouldn't be happening either! > Is this logical that the first thread is mutating into a > "non-servant upcall"? > > Maybe the new option MT_NOUPCALL in the next release will > prevent this condition? Although I suspect what we are experiencing is > unrelated to the function of this option. I am not sure whether that is going to help! Thanks Bala Please see my comments to Bala's response below: > > > What is intruiging is that the first thread, which starts out as a > > regular "servant upcall", appears to mutate into a "non-servant upcall" > > once the response has been sent back to the client. I assume that this > > happens since we are dealing with etherialisation of the servant or > > because we are in POA code which is post-response. The second thread > > however is unable to proceed with the invocation on the second > > (unrelated) object because it requires that no "non-servant" > > upcall is in progress. > > Right! The non-servant upcall in this case is the _remove_ref () > operation. This deadlock is evil. Can you please stick this bug report > in the bugzilla? For the timebeing you could work around by using > a oneway call to the another object. Could you also make up a simple > test case. We will try to address this by x.4.4? Getting it into x.4.3 > may be hard. > I agree that this deadlock is evil! I believe it is a result of sinister incompatibilities between strategies used in different parts of the TAO implementation. Recall that I am using the default values for all TAO options, and I am using a LF thread-pool containing 10 threads for my ORB Reactor. (I elaborate on the incompatibilities below.) > Another strange thing is that -- the call seems to be going remote > when you have the object in the same process space. Evil! Is this the > problem you reported before? This shouldn't be happening either! > No, this isn't the problem I was having before. In the problem here we are using the CollocationStrategy "thru_poa". Thus all requests on all objects (irrespective of collocation) are treated in the same manner (as I understand). This means that they pass through the POA, and are subject to the same Reactor dispatch strategy (which assigns incoming requests to threads from the LF thread pool), as well as being subject to interceptors. However, it is worth emphasising the various configurations that WILL experience this situation, and those that WILL NOT experiece it: This problem WILL be experienced if: 1. CollocationStrategy "thru_poa", default ACE_TP_Reactor reactor with a thread-pool containing more than one thread. 2. CollocationStrategy "direct", default ACE_TP_Reactor reactor with a thread-pool containing more than one thread, and if the application makes use of the IMR. This is the problem I described before (see bugzilla bug 1919). Here the IMR prevents the collocation direct optimisation from taking effect, and calls on collocated objects continue to pass thru POA. The problem WILL NOT be experienced if: 1. CollocationStrategy "thru_poa" is used, and a single threaded reactor is used. This is because the same thread will be reused to continue with the second request. This is the nature of the reactor which will reassign waiting threads to incoming requests. At the point where wait_for_non_servant_upcall_to_complete() occurs the check ACE_OS::thr_equal (this->non_servant_upcall_thread_, ACE_OS::thr_self ())) will prevent deadlock from arising. 2. CollocationStrategy "direct", default ACE_TP_Reactor reactor with a thread-pool containing more than one thread, when IMR is NOT used. Last night I pondered further on this problem and I would like to share my thoughts: As I see it, the problem is that a non-servant upcall (starting from the _remove_ref()) is getting resheduled to another thread, and worse still in the second thread it is no-longer considered as a non-servant upcall! Hence the clash between servant and non-servant upcalls. If direct collocation optimisation were in force only one thread would be involved and the entire operation would be completed as a non-servant upcall. *** Thus we see a sinister inter-play between the servant/non-servant logic and the LF reactor thread pool! *** I can imagine various ways to tackle this problem: 1. If TAO recognises that the second thread is a continuation of the first, it thus should also be treated as a non-servant upcall. The check wait_for_non_servant_upcall_to_complete() then falls away. 2. If TAO makes the reactor more "clever" so that it reuses the first thread (which is waiting anyway) to handle the request on the second object, then the check ACE_OS::thr_equal (this->non_servant_upcall_thread_, ACE_OS::thr_self ())) inside wait_for_non_servant_upcall_to_complete() will prevent deadlock from arising. 3. If the CollocationStategy "direct" optimisation is fixed, so that it works even when the IMR is deployed (see bug 1919). However this is only a work-around solution and it is not really acceptable because the problem will still exist with the "thu_poa" strategy (what most people use). > > Maybe the new option MT_NOUPCALL in the next release will > > prevent this condition? Although I suspect what we are experiencing is > > unrelated to the function of this option. > > I am not sure whether that is going to help! > I agree that this is unlikely to help, because as I understand, the MT_NOUPCALL means that waiting client threads will not be reused to handle incoming requests that arrive prior to the arrival of the response that the client thread is expecting. This certainly does not imply that the waiting thread will be re-used in the way I described above. On the contrary I would say that it rather implies the opposite! > Thanks > Bala To assist you, once the fix ready, if you sent me a patch I would be very happy to run the test again in my environment, and provide you with feedback. Thanks, Kostas. Hi Kostas:
>> Another strange thing is that -- the call seems to be going remote
>> when you have the object in the same process space. Evil! Is this
>> the problem you reported before? This shouldn't be happening
>> either!
>
> No, this isn't the problem I was having before. In the problem here
> we are using the CollocationStrategy "thru_poa". Thus all requests
> on all objects (irrespective of collocation) are treated in the same
> manner (as I understand). This means that they pass through the POA,
> and are subject to the same Reactor dispatch strategy (which assigns
> incoming requests to threads from the LF thread pool), as well as
> being subject to interceptors.
Bala is correct - this should not happen. If the servant is
collocated with the client, the Reactor should not get involved. The
upcall thread should take the call all the way through to the servant.
If that happens, the non-servant upcall will not cause any problems
since it will be the same thread. Can you please find out why TAO
thinks that the push is a remote call? Is it because the IMR got
involved, i.e., first a remote call was made to the IMR and it got
forwarded to a collocated object but did still made a remote call to
it?
Thanks,
Irfan
to pool Back to reporter, we need an automated regression test in order to analyze this with the current code base |