Please report new issues athttps://github.com/DOCGroup
SYNOPSIS: Consider 2 processes, one pure CORBA Client and a CORBA Server. The Server contains a Leader-Followers thread pool for handling of incomming requests. The default values for all options are used (i.e. -ORBClientConnectionHandler=MT and -ORBCollocationStrategy=thru_poa). The Client makes a call on the Server, which in turn makes a call on another object contained in the server. The POA of the server gets as far as responding to the client (which gives the initial impression that the call was successful). However deadlock results in the server as it completes "tidy-up" POA-related work associated with the consequences operation (I elaborate on this below). The incoming request from the client results in 2 threads getting conscripted. Deadlock results as the second waits for the first to complete, meanwhile the first is waiting on the second. DESCRIPTION: The following is a more detailed description of the scenario which induces the deadlock... [I have also attached the full stack frames for the two threads.] The operation that the client invokes on the Server is as follows: The Client makes a call to the Server instructing it to deactivate a CORBA Object contained in the Server. The deactivate operation is performed on the object which is to be deactivated. In the servant's destructor code we raise an event (signifying that the object was deleted) by calling push_structured_event() on another object. A thread from the LF thread pool (thread 1) undertakes servicing of the incoming request. The deactivation of the target object results in the POA remembering that the servant associated with incoming request needs to be etherialised once a response has been sent to the client. Thus as soon as the response is sent, the same thread proceeds to call the etherialise method of the ServantActivator, causing the reference count on the servant to be decremented, causing destructor to be called. As stated, our servant's destructor generates an event by calling push_structured_event() on another object. This causes a second thread from the LF thread pool (thread 2) to be enlisted, to handle call made on the second object. Deadlock results as the second waits for the first to complete, meanwhile the first is waiting on the second. The first thread is waiting in TAO_Leader_Follower::wait_for_event(), while the second is waiting in TAO_Object_Adapter::wait_for_non_servant_upcalls_to_complete(). What is intruiging is that the first thread, which starts out as a regular "servant upcall", appears to mutate into a "non-servant upcall" once the response has been sent back to the client. I assume that this happens since we are dealing with etherialisation of the servant or because we are in POA code which is post-response. The second thread however is unable to proceed with the invocation on the second (unrelated) object because it requires that no "non-servant" upcall is in progress. Is this logical that the first thread is mutating into a "non-servant upcall"? Maybe the new option MT_NOUPCALL in the next release will prevent this condition? Although I suspect what we are experiencing is unrelated to the function of this option. REPEAT BY: Refer to description above. SAMPLE FIX/WORKAROUND: None. (gdb) thread 1 (gdb) where #0 0xfd41a018 in _lwp_sema_wait () from /usr/lib/libc.so.1 #1 0xfd3497fc in _park () from /usr/lib/libthread.so.1 #2 0xfd3494d8 in _swtch () from /usr/lib/libthread.so.1 #3 0xfd34801c in cond_wait () from /usr/lib/libthread.so.1 #4 0xfd347f18 in pthread_cond_wait () from /usr/lib/libthread.so.1 #5 0xfde662dc in ACE_OS::cond_timedwait (cv=0xdcbc0, external_mutex=0x68080, timeout=0x0) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/ace/Time_Value.inl:201 #6 0xfe100dbc in TAO_Leader_Follower::wait_for_event (this=0x68078, event=0xfb507d68, transport=0xfe1027e8, max_wait_time=0x0) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/LF_Follower.inl:11 #7 0xfe0c4d80 in TAO::Synch_Twoway_Invocation::wait_for_reply ( this=0xfb508060, max_wait_time=0x0, rd=@0xfe1027e8, bd=@0xfb507d50) at Transport.inl:40 #8 0xfe0c4bb0 in TAO::Synch_Twoway_Invocation::remote_twoway ( this=0xfb508060, max_wait_time=0x0) at Synch_Invocation.cpp:160 #9 0xfe0c2354 in TAO::Invocation_Adapter::invoke_twoway (this=0xfb508310, op=@0x1, effective_target=@0xfb5081d8, r=@0xfb508140, max_wait_time=@0xfb5081d4) at Invocation_Adapter.cpp:262 #10 0xfe0c2264 in TAO::Invocation_Adapter::invoke_remote_i (this=0xfb508310, stub=0xfe0c22bc, details=@0xfb508250, effective_target=@0xfb5081d8, max_wait_time=@0xfb5081d4) at Invocation_Adapter.cpp:229 #11 0xfe0c1e5c in TAO::Invocation_Adapter::invoke_i (this=0xfb508310, stub=0xdcec8, details=@0xfb508250) at Invocation_Adapter.cpp:83 #12 0xfe0c1d78 in TAO::Invocation_Adapter::invoke (this=0xfb508310, ex_data=0xfedef850, ex_count=1) at Invocation_Adapter.cpp:44 #13 0xfedc0e8c in ChamNS::StructuredPushConsumer::push_structured_event ( this=0xdbc00, notification=@0xfb508460) at ChamNSC.cc:4106 #14 0xfef5431c in MO_EvEmitter_i::handleEvent (this=0xdbc00, ev=@0xfb508460) at MO_EvEmitter_i.cc:91 #15 0xfef6dbfc in EvManager::handleEvent (this=0x60990, ev=@0xf5890) at /tools/chameleon_tao_common/rw-ed6/rw_buildspace/rw/ev_cntnr.h:75 #16 0xfef48b78 in MObject_i::push (notification=@0xf5890) at MObject_i.cc:96 #17 0xfef4cab0 in MObject_i::announceDeletion (this=0xf5938) at MObject_i.cc:1234 #18 0xfef49a3c in MObject_i::~MObject_i (this=0xf5938, __vtt_parm=0xfd43c000) at MObject_i.cc:368 #19 0xfef51438 in MO_Node_i::~MO_Node_i (this=0xf5928) at MO_Node_i.cc:23 #20 0xfe3ded30 in TAO_RefCountServantBase::_remove_ref (this=0xf5990) at Servant_Base.cpp:338 #21 0xfef6cdcc in MOActivator::etherealize (this=0xaebe0, oid=@0xf59c8, servant=0xf592c, cleanup_in_progress=false, remaining_activations=48) at MOActivator.cc:228 #22 0xfe3ae390 in TAO_POA::cleanup_servant (this=0x6c838, active_object_map_entry=0xf59c8) at POA.cpp:1738 #23 0xfe39eda4 in TAO_Object_Adapter::Servant_Upcall::servant_cleanup ( this=0xfb5089d8) at Object_Adapter.cpp:1738 #24 0xfe39ea58 in TAO_Object_Adapter::Servant_Upcall::upcall_cleanup ( this=0xfb5089d8) at Object_Adapter.cpp:1592 #25 0xfe39e9a4 in TAO_Object_Adapter::Servant_Upcall::~Servant_Upcall ( this=0xfb5089d8) at Object_Adapter.cpp:1556 #26 0xfe39bd30 in TAO_Object_Adapter::dispatch_servant (this=0x0, key=@0xfb508cd8, req=@0xfb508cd8, forward_to={ptr_ = @0xfb508cc8}) at Object_Adapter.cpp:344 #27 0xfe39c9dc in TAO_Object_Adapter::dispatch (this=0x69ae0, key=@0xfb508d44, request=@0xfb508cd8, forward_to={ptr_ = @0xfb508cc8}) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Pseudo_VarOut_T.inl:131 #28 0xfe12989c in TAO_Adapter_Registry::dispatch (this=0x577ec, key=@0xfb508d44, request=@0xfb508cd8, forward_to={ptr_ = @0xfb508cc8}) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Pseudo_VarOut_T.inl:131 #29 0xfe142fcc in TAO_Request_Dispatcher::dispatch (this=<incomplete type>, orb_core=0x57660, request=@0xfb508cd8, forward_to={ptr_ = @0x57660}) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Tagged_Profile.i:23 #30 0xfe16aa08 in TAO_GIOP_Message_Base::process_request (this=0xf4ad8, transport=0xf48d8, cdr=@0xfb508e58, output=@0xfb508e90, parser=0xf4b00) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Object.i:64 #31 0xfe16a4f8 in TAO_GIOP_Message_Base::process_request_message ( this=0xf4ad8, transport=0xf48d8, qd=0x0) at GIOP_Message_Base.cpp:700 #32 0xfe14ae1c in TAO_Transport::process_parsed_messages (this=0xfe16a2fc, qd=0xfb5091f8, rh=@0xf4ad8) at Transport.cpp:1810 #33 0xfe149e6c in TAO_Transport::handle_input (this=0xf48d8, rh=@0xfb5096e0, max_wait_time=0x0) at Transport.cpp:1282 #34 0xfe150bc0 in TAO_Connection_Handler::handle_input_eh (this=0xf489c, h=20, eh=0xf4800) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Connection_Handler.inl:19 #35 0xfe161fd0 in TAO_IIOP_Connection_Handler::handle_input (this=0xf4800, h=20) at IIOP_Connection_Handler.cpp:168 #36 0xfde8d694 in ACE_TP_Reactor::dispatch_socket_event (this=0x680f8, dispatch_info=@0x1) at TP_Reactor.cpp:657 #37 0xfde8cfb0 in ACE_TP_Reactor::handle_socket_events (this=0x680f8, event_count=@0xfb5098d4, guard=@0x1) at TP_Reactor.cpp:499 #38 0xfde8cc64 in ACE_TP_Reactor::dispatch_i (this=0x680f8, max_wait_time=0xfb5098d4, guard=@0xfb509948) at TP_Reactor.cpp:266 #39 0xfde8caac in ACE_TP_Reactor::handle_events (this=0x680f8, max_wait_time=0x0) at TP_Reactor.cpp:171 #40 0xfde892e8 in ACE_Reactor::handle_events (this=0x680f8, max_wait_time=0x0) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/ace/Reactor.inl:166 #41 0xfe0f8eb4 in TAO_ORB_Core::run (this=0x57660, tv=0x0, perform_work=0) at ORB_Core.cpp:1893 #42 0xfec2e1e0 in CHAM::ORB::run_ () from /home/projects/chameleon/lib/libchamorb.so #43 0xfec4334c in ThreadPoolManager::svc () from /home/projects/chameleon/lib/libchamorb.so #44 0xfdebd480 in ACE_Task_Base::svc_run (args=0xffbedcd0) at Task.cpp:203 #45 0xfde70d38 in ACE_Thread_Adapter::invoke_i (this=0xe18f8) at Thread_Adapter.cpp:149 #46 0xfde70ca4 in ACE_Thread_Adapter::invoke (this=0xe18f8) at Thread_Adapter.cpp:93 #47 0xfde3b608 in ace_thread_adapter (args=0xe18f8) at Base_Thread_Adapter.cpp:131 #48 0xfd35bb3c in _thread_start () from /usr/lib/libthread.so.1 (gdb) (gdb) (gdb) (gdb) thread 2 (gdb) where #0 0xfd348014 in cond_wait () from /usr/lib/libthread.so.1 #1 0xfd347f18 in pthread_cond_wait () from /usr/lib/libthread.so.1 #2 0xfde66108 in ACE_Condition_Thread_Mutex::wait (this=0x69b30) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/ace/OS_NS_Thread.inl:484 #3 0xfe39eafc in TAO_Object_Adapter::wait_for_non_servant_upcalls_to_complete (this=0x69ae0, _ACE_CORBA_Environment_variable=@0x159c) at Object_Adapter.cpp:1636 #4 0xfe39eb5c in TAO_Object_Adapter::wait_for_non_servant_upcalls_to_complete (this=0x69ae0) at Object_Adapter.cpp:1649 #5 0xfe39e684 in TAO_Object_Adapter::Servant_Upcall::prepare_for_upcall_i ( this=0xfcb049d8, key=@0xfcb04d44, operation=0xfcb052ac "push_structured_event", forward_to= {ptr_ = @0xfcb04cc8}, wait_occurred_restart_call=@0xfcb04954) at Object_Adapter.cpp:1392 #6 0xfe39e60c in TAO_Object_Adapter::Servant_Upcall::prepare_for_upcall ( this=0xfcb049d8, key=@0xfcb04d44, operation=0xfcb052ac "push_structured_event", forward_to= {ptr_ = @0xfcb04cc8}) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Pseudo_VarOut_T.inl:131 #7 0xfe39bcec in TAO_Object_Adapter::dispatch_servant (this=0x69ae0, key=@0xfcb04d44, req=@0xfcb04cd8, forward_to={ptr_ = @0xfcb04cc8}) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Pseudo_VarOut_T.inl:131 #8 0xfe39c9dc in TAO_Object_Adapter::dispatch (this=0x69ae0, key=@0xfcb04d44, request=@0xfcb04cd8, forward_to={ptr_ = @0xfcb04cc8}) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Pseudo_VarOut_T.inl:131 #9 0xfe12989c in TAO_Adapter_Registry::dispatch (this=0x577ec, key=@0xfcb04d44, request=@0xfcb04cd8, forward_to={ptr_ = @0xba80f08}) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Pseudo_VarOut_T.inl:131 #10 0xfe142fcc in TAO_Request_Dispatcher::dispatch (this=<incomplete type>, orb_core=0x57660, request=@0xfcb04cd8, forward_to={ptr_ = @0x9a6a0a8}) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Tagged_Profile.i:23 #11 0xfe16aa08 in TAO_GIOP_Message_Base::process_request (this=0xb0ac0, transport=0xb0888, cdr=@0xfcb04e58, output=@0xfcb04e90, parser=0xb0ae8) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Object.i:64 #12 0xfe16a4f8 in TAO_GIOP_Message_Base::process_request_message ( this=0xb0ac0, transport=0xb0888, qd=0x0) at GIOP_Message_Base.cpp:700 #13 0xfe14ae1c in TAO_Transport::process_parsed_messages (this=0xfe16a2fc, qd=0xfcb051f8, rh=@0xb0ac0) at Transport.cpp:1810 #14 0xfe149e6c in TAO_Transport::handle_input (this=0xb0888, rh=@0xfcb056e0, max_wait_time=0x0) at Transport.cpp:1282 #15 0xfe150bc0 in TAO_Connection_Handler::handle_input_eh (this=0xb080c, h=19, eh=0xb0770) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/TAO/tao/Connection_Handler.inl:19 #16 0xfe161fd0 in TAO_IIOP_Connection_Handler::handle_input (this=0xb0770, h=19) at IIOP_Connection_Handler.cpp:168 #17 0xfde8d694 in ACE_TP_Reactor::dispatch_socket_event (this=0x680f8, dispatch_info=@0x1) at TP_Reactor.cpp:657 #18 0xfde8cfb0 in ACE_TP_Reactor::handle_socket_events (this=0x680f8, event_count=@0xfcb058d4, guard=@0x1) at TP_Reactor.cpp:499 #19 0xfde8cc64 in ACE_TP_Reactor::dispatch_i (this=0x680f8, max_wait_time=0xfcb058d4, guard=@0xfcb05948) at TP_Reactor.cpp:266 #20 0xfde8caac in ACE_TP_Reactor::handle_events (this=0x680f8, max_wait_time=0x0) at TP_Reactor.cpp:171 #21 0xfde892e8 in ACE_Reactor::handle_events (this=0x680f8, max_wait_time=0x0) at /tools/chameleon_tao_common/tao1.4.2/ACE_wrappers/ace/Reactor.inl:166 #22 0xfe0f8eb4 in TAO_ORB_Core::run (this=0x57660, tv=0x0, perform_work=0) at ORB_Core.cpp:1893 #23 0xfec2e1e0 in CHAM::ORB::run_ () from /home/projects/chameleon/lib/libchamorb.so #24 0xfec4334c in ThreadPoolManager::svc () from /home/projects/chameleon/lib/libchamorb.so #25 0xfdebd480 in ACE_Task_Base::svc_run (args=0xffbedcd0) at Task.cpp:203 #26 0xfde70d38 in ACE_Thread_Adapter::invoke_i (this=0xdb7a0) at Thread_Adapter.cpp:149 #27 0xfde70ca4 in ACE_Thread_Adapter::invoke (this=0xdb7a0) at Thread_Adapter.cpp:93 #28 0xfde3b608 in ace_thread_adapter (args=0xdb7a0) at Base_Thread_Adapter.cpp:131 #29 0xfd35bb3c in _thread_start () from /usr/lib/libthread.so.1 (gdb) (gdb) (gdb)
From: Balachandran Natarajan <bala@cs.wustl.edu> > What is intruiging is that the first thread, which starts out as a > regular "servant upcall", appears to mutate into a "non-servant upcall" > once the response has been sent back to the client. I assume that this > happens since we are dealing with etherialisation of the servant or > because we are in POA code which is post-response. The second thread > however is unable to proceed with the invocation on the second > (unrelated) object because it requires that no "non-servant" > upcall is in progress. Right! The non-servant upcall in this case is the _remove_ref () operation. This deadlock is evil. Can you please stick this bug report in the bugzilla? For the timebeing you could work around by using a oneway call to the another object. Could you also make up a simple test case. We will try to address this by x.4.4? Getting it into x.4.3 may be hard. Another strange thing is that -- the call seems to be going remote when you have the object in the same process space. Evil! Is this the problem you reported before? This shouldn't be happening either! > Is this logical that the first thread is mutating into a > "non-servant upcall"? > > Maybe the new option MT_NOUPCALL in the next release will > prevent this condition? Although I suspect what we are experiencing is > unrelated to the function of this option. I am not sure whether that is going to help! Thanks Bala
Please see my comments to Bala's response below: > > > What is intruiging is that the first thread, which starts out as a > > regular "servant upcall", appears to mutate into a "non-servant upcall" > > once the response has been sent back to the client. I assume that this > > happens since we are dealing with etherialisation of the servant or > > because we are in POA code which is post-response. The second thread > > however is unable to proceed with the invocation on the second > > (unrelated) object because it requires that no "non-servant" > > upcall is in progress. > > Right! The non-servant upcall in this case is the _remove_ref () > operation. This deadlock is evil. Can you please stick this bug report > in the bugzilla? For the timebeing you could work around by using > a oneway call to the another object. Could you also make up a simple > test case. We will try to address this by x.4.4? Getting it into x.4.3 > may be hard. > I agree that this deadlock is evil! I believe it is a result of sinister incompatibilities between strategies used in different parts of the TAO implementation. Recall that I am using the default values for all TAO options, and I am using a LF thread-pool containing 10 threads for my ORB Reactor. (I elaborate on the incompatibilities below.) > Another strange thing is that -- the call seems to be going remote > when you have the object in the same process space. Evil! Is this the > problem you reported before? This shouldn't be happening either! > No, this isn't the problem I was having before. In the problem here we are using the CollocationStrategy "thru_poa". Thus all requests on all objects (irrespective of collocation) are treated in the same manner (as I understand). This means that they pass through the POA, and are subject to the same Reactor dispatch strategy (which assigns incoming requests to threads from the LF thread pool), as well as being subject to interceptors. However, it is worth emphasising the various configurations that WILL experience this situation, and those that WILL NOT experiece it: This problem WILL be experienced if: 1. CollocationStrategy "thru_poa", default ACE_TP_Reactor reactor with a thread-pool containing more than one thread. 2. CollocationStrategy "direct", default ACE_TP_Reactor reactor with a thread-pool containing more than one thread, and if the application makes use of the IMR. This is the problem I described before (see bugzilla bug 1919). Here the IMR prevents the collocation direct optimisation from taking effect, and calls on collocated objects continue to pass thru POA. The problem WILL NOT be experienced if: 1. CollocationStrategy "thru_poa" is used, and a single threaded reactor is used. This is because the same thread will be reused to continue with the second request. This is the nature of the reactor which will reassign waiting threads to incoming requests. At the point where wait_for_non_servant_upcall_to_complete() occurs the check ACE_OS::thr_equal (this->non_servant_upcall_thread_, ACE_OS::thr_self ())) will prevent deadlock from arising. 2. CollocationStrategy "direct", default ACE_TP_Reactor reactor with a thread-pool containing more than one thread, when IMR is NOT used. Last night I pondered further on this problem and I would like to share my thoughts: As I see it, the problem is that a non-servant upcall (starting from the _remove_ref()) is getting resheduled to another thread, and worse still in the second thread it is no-longer considered as a non-servant upcall! Hence the clash between servant and non-servant upcalls. If direct collocation optimisation were in force only one thread would be involved and the entire operation would be completed as a non-servant upcall. *** Thus we see a sinister inter-play between the servant/non-servant logic and the LF reactor thread pool! *** I can imagine various ways to tackle this problem: 1. If TAO recognises that the second thread is a continuation of the first, it thus should also be treated as a non-servant upcall. The check wait_for_non_servant_upcall_to_complete() then falls away. 2. If TAO makes the reactor more "clever" so that it reuses the first thread (which is waiting anyway) to handle the request on the second object, then the check ACE_OS::thr_equal (this->non_servant_upcall_thread_, ACE_OS::thr_self ())) inside wait_for_non_servant_upcall_to_complete() will prevent deadlock from arising. 3. If the CollocationStategy "direct" optimisation is fixed, so that it works even when the IMR is deployed (see bug 1919). However this is only a work-around solution and it is not really acceptable because the problem will still exist with the "thu_poa" strategy (what most people use). > > Maybe the new option MT_NOUPCALL in the next release will > > prevent this condition? Although I suspect what we are experiencing is > > unrelated to the function of this option. > > I am not sure whether that is going to help! > I agree that this is unlikely to help, because as I understand, the MT_NOUPCALL means that waiting client threads will not be reused to handle incoming requests that arrive prior to the arrival of the response that the client thread is expecting. This certainly does not imply that the waiting thread will be re-used in the way I described above. On the contrary I would say that it rather implies the opposite! > Thanks > Bala To assist you, once the fix ready, if you sent me a patch I would be very happy to run the test again in my environment, and provide you with feedback. Thanks, Kostas.
Hi Kostas: >> Another strange thing is that -- the call seems to be going remote >> when you have the object in the same process space. Evil! Is this >> the problem you reported before? This shouldn't be happening >> either! > > No, this isn't the problem I was having before. In the problem here > we are using the CollocationStrategy "thru_poa". Thus all requests > on all objects (irrespective of collocation) are treated in the same > manner (as I understand). This means that they pass through the POA, > and are subject to the same Reactor dispatch strategy (which assigns > incoming requests to threads from the LF thread pool), as well as > being subject to interceptors. Bala is correct - this should not happen. If the servant is collocated with the client, the Reactor should not get involved. The upcall thread should take the call all the way through to the servant. If that happens, the non-servant upcall will not cause any problems since it will be the same thread. Can you please find out why TAO thinks that the push is a remote call? Is it because the IMR got involved, i.e., first a remote call was made to the IMR and it got forwarded to a collocated object but did still made a remote call to it? Thanks, Irfan
to pool
Back to reporter, we need an automated regression test in order to analyze this with the current code base