Please report new issues athttps://github.com/DOCGroup
Our server application is deadlocking while processing large octet sequences using the TAO_Leader_Follower_Flushing_Strategy. This seems to occur after the following message is reported by the orb with -ORBDebugLevel set to 1: Sep 18 11:37:51.953 2001@LM_ERROR@TAO: (7036|2380) TAO_ServerRequest::tao_send_r eply: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full. : cannot send reply This cannot be reproduced with 1.1.17. I have modified the File_IO testcase and can reproduce the problem fairly easily. I will attach this testcase to the bug. If I set the orb option to -ORBFlushingStrategy full I cannot reproduce the problem. Stack traces of all the active threads are included below: NTDLL! ZwWaitForSingleObject@12 + 11 bytes MSAFD! SockWaitForSingleObject@16 + 306 bytes MSAFD! WSPSelect@24 + 1021 bytes WS2_32! select@20 + 203 bytes ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> >::wait_for_multiple_events (ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> > * const 0x00000002, ACE_Select_Reactor_Handle_Set & {...}, ACE_Time_Value * 0x00000000) line 1013 + 54 bytes ACE_TP_Reactor::get_event_for_dispatching(ACE_TP_Reactor * const 0x00000002, ACE_Time_Value * 0x00000000) line 462 + 15 bytes ACE_TP_Reactor::dispatch_i(ACE_TP_Reactor * const 0x00000002, ACE_Time_Value * 0x00000000, ACE_TP_Token_Guard & {...}) line 152 ACE_TP_Reactor::handle_events(ACE_TP_Reactor * const 0x00000002, ACE_Time_Value * 0x00000000) line 134 + 14 bytes ACE_Reactor::handle_events(ACE_Reactor * const 0x00000002, ACE_Time_Value * 0x00000000) line 158 TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x00000002, TAO_LF_Event * 0x0191a9a0, TAO_Transport * 0x01004680, ACE_Time_Value * 0x00000000) line 380 TAO_Leader_Follower_Flushing_Strategy::flush_message (TAO_Leader_Follower_Flushing_Strategy * const 0x00000002, TAO_Transport * 0x01004680, TAO_Queued_Message * 0x0191a9a0, ACE_Time_Value * 0x00000000) line 32 + 16 bytes TAO_Transport::send_synchronous_message_i(TAO_Transport * const 0x00000002, const ACE_Message_Block * 0x010046a8, ACE_Time_Value * 0x00000000) line 481 TAO_Transport::send_message_i(TAO_Transport * const 0x00000002, TAO_Stub * 0x00000000, int 1, const ACE_Message_Block * 0x0191b060, ACE_Time_Value * 0x00000000) line 301 + 13 bytes TAO_IIOP_Transport::send_message(TAO_IIOP_Transport * const 0x00000002, TAO_OutputCDR & {...}, TAO_Stub * 0x00000000, int 1, ACE_Time_Value * 0x00000000) line 185 TAO_ServerRequest::tao_send_reply(TAO_ServerRequest * const 0x00000002) line 232 TAO_ServantBase::synchronous_upcall_dispatch(TAO_ServantBase * const 0x00000002, TAO_ServerRequest & {...}, void * 0x0191ab10, void * 0x01b80110, CORBA_Environment & {...}) line 262 POA_DTSStudyFile::DTSStudy::_dispatch(POA_DTSStudyFile::DTSStudy * const 0x00000002, TAO_ServerRequest & {...}, void * 0x0191ab10, CORBA_Environment & {...}) line 3550 TAO_Default_Servant_Dispatcher::dispatch(TAO_Default_Servant_Dispatcher * const 0x00000002, TAO_Object_Adapter::Servant_Upcall & {...}, TAO_ServerRequest & {...}, CORBA_Environment & {...}) line 21 TAO_Object_Adapter::dispatch_servant(TAO_Object_Adapter * const 0x00000002, const TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 323 + 16 bytes TAO_Object_Adapter::dispatch(TAO_Object_Adapter * const 0x00000002, TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 733 TAO_Adapter_Registry::dispatch(TAO_Adapter_Registry * const 0x00000002, TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 120 TAO_GIOP_Message_Base::process_request(TAO_GIOP_Message_Base * const 0x00000002, TAO_Transport * 0x01004680, TAO_InputCDR & {...}, TAO_OutputCDR & {...}) line 778 TAO_GIOP_Message_Base::process_request_message(TAO_GIOP_Message_Base * const 0x00000002, TAO_Transport * 0x01004680, TAO_Queued_Data * 0x01b4f702) line 594 + 21 bytes TAO_Transport::process_parsed_messages(TAO_Transport * const 0x00000002, TAO_Queued_Data * 0x01b4e3f0, TAO_Resume_Handle & {...}) line 1361 + 16 bytes TAO_Transport::process_queue_head(TAO_Transport * const 0x00000002, TAO_Resume_Handle & {...}) line 1549 + 16 bytes TAO_Transport::handle_input_i(TAO_Transport * const 0x00000002, TAO_Resume_Handle & {...}, ACE_Time_Value * 0xff335756, int -1660142279) line 782 df75e810() 0d05bcbc() NTDLL! ZwWaitForSingleObject@12 + 11 bytes KERNEL32! WaitForSingleObjectEx@12 + 90 bytes KERNEL32! WaitForSingleObject@8 + 15 bytes ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00f78e94) line 6313 + 11 bytes ACE_OS::cond_timedwait(ACE_cond_t * 0x01bd95e4, _RTL_CRITICAL_SECTION * 0x00f78e94, ACE_Time_Value * 0x00000000) line 6205 + 11 bytes ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0x7ffd7000, const ACE_Time_Value * 0x00000000) line 642 + 14 bytes TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x7ffd7000, TAO_LF_Event * 0x01a1f054, TAO_Transport * 0x01004680, ACE_Time_Value * 0x00000000) line 245 + 14 bytes TAO_Leader_Follower_Flushing_Strategy::flush_message (TAO_Leader_Follower_Flushing_Strategy * const 0x7ffd7000, TAO_Transport * 0x01004680, TAO_Queued_Message * 0x01a1f054, ACE_Time_Value * 0x00000000) line 32 + 16 bytes TAO_Transport::send_synchronous_message_i(TAO_Transport * const 0x7ffd7000, const ACE_Message_Block * 0x010046a8, ACE_Time_Value * 0x00000000) line 481 TAO_Transport::send_message_i(TAO_Transport * const 0x7ffd7000, TAO_Stub * 0x00000000, int 1, const ACE_Message_Block * 0x01a1f714, ACE_Time_Value * 0x00000000) line 301 + 13 bytes TAO_IIOP_Transport::send_message(TAO_IIOP_Transport * const 0x7ffd7000, TAO_OutputCDR & {...}, TAO_Stub * 0x00000000, int 1, ACE_Time_Value * 0x00000000) line 185 TAO_ServerRequest::tao_send_reply(TAO_ServerRequest * const 0x7ffd7000) line 232 TAO_ServantBase::synchronous_upcall_dispatch(TAO_ServantBase * const 0x7ffd7000, TAO_ServerRequest & {...}, void * 0x01a1f1c4, void * 0x01b80110, CORBA_Environment & {...}) line 262 POA_DTSStudyFile::DTSStudy::_dispatch(POA_DTSStudyFile::DTSStudy * const 0x7ffd7000, TAO_ServerRequest & {...}, void * 0x01a1f1c4, CORBA_Environment & {...}) line 3550 TAO_Default_Servant_Dispatcher::dispatch(TAO_Default_Servant_Dispatcher * const 0x7ffd7000, TAO_Object_Adapter::Servant_Upcall & {...}, TAO_ServerRequest & {...}, CORBA_Environment & {...}) line 21 TAO_Object_Adapter::dispatch_servant(TAO_Object_Adapter * const 0x7ffd7000, const TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 323 + 16 bytes TAO_Object_Adapter::dispatch(TAO_Object_Adapter * const 0x7ffd7000, TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 733 TAO_Adapter_Registry::dispatch(TAO_Adapter_Registry * const 0x7ffd7000, TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 120 TAO_GIOP_Message_Base::process_request(TAO_GIOP_Message_Base * const 0x7ffd7000, TAO_Transport * 0x01004680, TAO_InputCDR & {...}, TAO_OutputCDR & {...}) line 778 TAO_GIOP_Message_Base::process_request_message(TAO_GIOP_Message_Base * const 0x7ffd7000, TAO_Transport * 0x01004680, TAO_Queued_Data * 0x01b4ff02) line 594 + 21 bytes TAO_Transport::process_parsed_messages(TAO_Transport * const 0x7ffd7000, TAO_Queued_Data * 0x01d1af08, TAO_Resume_Handle & {...}) line 1361 + 16 bytes TAO_Transport::process_queue_head(TAO_Transport * const 0x7ffd7000, TAO_Resume_Handle & {...}) line 1549 + 16 bytes TAO_Transport::consolidate_extra_messages(TAO_Transport * const 0x7ffd7000, ACE_Message_Block & {...}, TAO_Resume_Handle & {...}) line 1327 + 10 bytes TAO_Transport::parse_consolidate_messages(TAO_Transport * const 0x7ffd7000, ACE_Message_Block & {...}, TAO_Resume_Handle & {...}, ACE_Time_Value * 0x00000000) line 910 + 12 bytes TAO_Transport::handle_input_i(TAO_Transport * const 0x7ffd7000, TAO_Resume_Handle & {...}, ACE_Time_Value * 0x00000000, int 0) line 859 TAO_IIOP_Connection_Handler::handle_input(TAO_IIOP_Connection_Handler * const 0x7ffd7000, void * 0x00000434) line 336 ACE_TP_Reactor::dispatch_socket_event(ACE_TP_Reactor * const 0x7ffd7000, ACE_EH_Dispatch_Info & {...}) line 569 + 6 bytes ACE_TP_Reactor::handle_socket_events(ACE_TP_Reactor * const 0x7ffd7000, int & 0, ACE_TP_Token_Guard & {...}) line 375 + 13 bytes ACE_TP_Reactor::dispatch_i(ACE_TP_Reactor * const 0x7ffd7000, ACE_Time_Value * 0x00000000, ACE_TP_Token_Guard & {...}) line 201 + 14 bytes ACE_TP_Reactor::handle_events(ACE_TP_Reactor * const 0x7ffd7000, ACE_Time_Value * 0x00000000) line 134 + 14 bytes ACE_Reactor::handle_events(ACE_Reactor * const 0x7ffd7000, ACE_Time_Value * 0x00000000) line 158 TAO_ORB_Core::run(TAO_ORB_Core * const 0x7ffd7000, ACE_Time_Value * 0x00000000, int 0, CORBA_Environment & {...}) line 2117 CORBA_ORB::run(CORBA_ORB * const 0x7ffd7000, ACE_Time_Value * 0x00000000, CORBA_Environment & {...}) line 248 + 18 bytes CORBA_ORB::run(CORBA_ORB * const 0x7ffd7000, CORBA_Environment & {...}) line 233 TAO_ORB_Manager::run(TAO_ORB_Manager * const 0x7ffd7000, CORBA_Environment & {...}) line 319 DLL_ORB::svc(DLL_ORB * const 0x7ffd7000) line 60 ACE_Task_Base::svc_run(void * 0x00f6cc18) line 211 ACE_Thread_Hook::start(ACE_Thread_Hook * const 0x7ffd7000, void * (void *)* 0x1005496e ACE_Task_Base::svc_run(void *), void * 0x00f6cc18) line 11 + 8 bytes Thread_Hook::start(Thread_Hook * const 0x7ffd7000, void * (void *)* 0x1005496e ACE_Task_Base::svc_run(void *), void * 0x00f6cc18) line 26 ACE_Thread_Adapter::invoke_i(ACE_Thread_Adapter * const 0x7ffd7000) line 142 ACE_Thread_Adapter::invoke(ACE_Thread_Adapter * const 0x7ffd7000) line 91 + 7 bytes MSVCRT! 7800a3c0() KERNEL32! NTDLL! ZwWaitForSingleObject@12 + 11 bytes KERNEL32! WaitForSingleObjectEx@12 + 90 bytes KERNEL32! WaitForSingleObject@8 + 15 bytes ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00f78e94) line 6313 + 11 bytes ACE_OS::cond_timedwait(ACE_cond_t * 0x00f741a4, _RTL_CRITICAL_SECTION * 0x00f78e94, ACE_Time_Value * 0x00000000) line 6205 + 11 bytes ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0xffffffff, const ACE_Time_Value * 0x00000000) line 642 + 14 bytes TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0xffffffff, TAO_LF_Event * 0x0171f0a4, TAO_Transport * 0x01004680, ACE_Time_Value * 0x00000000) line 245 + 14 bytes TAO_Leader_Follower_Flushing_Strategy::flush_message (TAO_Leader_Follower_Flushing_Strategy * const 0xffffffff, TAO_Transport * 0x01004680, TAO_Queued_Message * 0x0171f0a4, ACE_Time_Value * 0x00000000) line 32 + 16 bytes TAO_Transport::send_synchronous_message_i(TAO_Transport * const 0xffffffff, const ACE_Message_Block * 0x010046a8, ACE_Time_Value * 0x00000000) line 481 TAO_Transport::send_message_i(TAO_Transport * const 0xffffffff, TAO_Stub * 0x00000000, int 1, const ACE_Message_Block * 0x0171f764, ACE_Time_Value * 0x00000000) line 301 + 13 bytes TAO_IIOP_Transport::send_message(TAO_IIOP_Transport * const 0xffffffff, TAO_OutputCDR & {...}, TAO_Stub * 0x00000000, int 1, ACE_Time_Value * 0x00000000) line 185 TAO_ServerRequest::tao_send_reply(TAO_ServerRequest * const 0xffffffff) line 232 TAO_ServantBase::synchronous_upcall_dispatch(TAO_ServantBase * const 0xffffffff, TAO_ServerRequest & {...}, void * 0x0171f214, void * 0x01b80110, CORBA_Environment & {...}) line 262 POA_DTSStudyFile::DTSStudy::_dispatch(POA_DTSStudyFile::DTSStudy * const 0xffffffff, TAO_ServerRequest & {...}, void * 0x0171f214, CORBA_Environment & {...}) line 3550 TAO_Default_Servant_Dispatcher::dispatch(TAO_Default_Servant_Dispatcher * const 0xffffffff, TAO_Object_Adapter::Servant_Upcall & {...}, TAO_ServerRequest & {...}, CORBA_Environment & {...}) line 21 TAO_Object_Adapter::dispatch_servant(TAO_Object_Adapter * const 0xffffffff, const TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 323 + 16 bytes TAO_Object_Adapter::dispatch(TAO_Object_Adapter * const 0xffffffff, TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 733 TAO_Adapter_Registry::dispatch(TAO_Adapter_Registry * const 0xffffffff, TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 120 TAO_GIOP_Message_Base::process_request(TAO_GIOP_Message_Base * const 0xffffffff, TAO_Transport * 0x01004680, TAO_InputCDR & {...}, TAO_OutputCDR & {...}) line 778 TAO_GIOP_Message_Base::process_request_message(TAO_GIOP_Message_Base * const 0xffffffff, TAO_Transport * 0x01004680, TAO_Queued_Data * 0x01b4fc02) line 594 + 21 bytes TAO_Transport::process_parsed_messages(TAO_Transport * const 0xffffffff, TAO_Queued_Data * 0x01d1adf8, TAO_Resume_Handle & {...}) line 1361 + 16 bytes TAO_Transport::process_queue_head(TAO_Transport * const 0xffffffff, TAO_Resume_Handle & {...}) line 1549 + 16 bytes TAO_Transport::handle_input_i(TAO_Transport * const 0xffffffff, TAO_Resume_Handle & {...}, ACE_Time_Value * 0xff335756, int -1660142279) line 782 df75e810() 0d05bcbc() NTDLL! ZwWaitForSingleObject@12 + 11 bytes KERNEL32! WaitForSingleObjectEx@12 + 90 bytes KERNEL32! WaitForSingleObject@8 + 15 bytes ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00f78e94) line 6313 + 11 bytes ACE_OS::cond_timedwait(ACE_cond_t * 0x01b80dfc, _RTL_CRITICAL_SECTION * 0x00f78e94, ACE_Time_Value * 0x00000000) line 6205 + 11 bytes ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0x00000316, const ACE_Time_Value * 0x00000000) line 642 + 14 bytes TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x00000316, TAO_LF_Event * 0x0181f0a4, TAO_Transport * 0x01004680, ACE_Time_Value * 0x00000000) line 245 + 14 bytes TAO_Leader_Follower_Flushing_Strategy::flush_message (TAO_Leader_Follower_Flushing_Strategy * const 0x00000316, TAO_Transport * 0x01004680, TAO_Queued_Message * 0x0181f0a4, ACE_Time_Value * 0x00000000) line 32 + 16 bytes TAO_Transport::send_synchronous_message_i(TAO_Transport * const 0x00000316, const ACE_Message_Block * 0x010046a8, ACE_Time_Value * 0x00000000) line 481 TAO_Transport::send_message_i(TAO_Transport * const 0x00000316, TAO_Stub * 0x00000000, int 1, const ACE_Message_Block * 0x0181f764, ACE_Time_Value * 0x00000000) line 301 + 13 bytes TAO_IIOP_Transport::send_message(TAO_IIOP_Transport * const 0x00000316, TAO_OutputCDR & {...}, TAO_Stub * 0x00000000, int 1, ACE_Time_Value * 0x00000000) line 185 TAO_ServerRequest::tao_send_reply(TAO_ServerRequest * const 0x00000316) line 232 TAO_ServantBase::synchronous_upcall_dispatch(TAO_ServantBase * const 0x00000316, TAO_ServerRequest & {...}, void * 0x0181f214, void * 0x01b80110, CORBA_Environment & {...}) line 262 POA_DTSStudyFile::DTSStudy::_dispatch(POA_DTSStudyFile::DTSStudy * const 0x00000316, TAO_ServerRequest & {...}, void * 0x0181f214, CORBA_Environment & {...}) line 3550 TAO_Default_Servant_Dispatcher::dispatch(TAO_Default_Servant_Dispatcher * const 0x00000316, TAO_Object_Adapter::Servant_Upcall & {...}, TAO_ServerRequest & {...}, CORBA_Environment & {...}) line 21 TAO_Object_Adapter::dispatch_servant(TAO_Object_Adapter * const 0x00000316, const TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 323 + 16 bytes TAO_Object_Adapter::dispatch(TAO_Object_Adapter * const 0x00000316, TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 733 TAO_Adapter_Registry::dispatch(TAO_Adapter_Registry * const 0x00000316, TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, CORBA_Environment & {...}) line 120 TAO_GIOP_Message_Base::process_request(TAO_GIOP_Message_Base * const 0x00000316, TAO_Transport * 0x01004680, TAO_InputCDR & {...}, TAO_OutputCDR & {...}) line 778 TAO_GIOP_Message_Base::process_request_message(TAO_GIOP_Message_Base * const 0x00000316, TAO_Transport * 0x01004680, TAO_Queued_Data * 0x0100e802) line 594 + 21 bytes TAO_Transport::process_parsed_messages(TAO_Transport * const 0x00000316, TAO_Queued_Data * 0x01b66ff8, TAO_Resume_Handle & {...}) line 1361 + 16 bytes TAO_Transport::process_queue_head(TAO_Transport * const 0x00000316, TAO_Resume_Handle & {...}) line 1549 + 16 bytes TAO_Transport::handle_input_i(TAO_Transport * const 0x00000316, TAO_Resume_Handle & {...}, ACE_Time_Value * 0xff335756, int -1660142279) line 782 df75e810() 0d05bcbc() NTDLL! ZwWaitForSingleObject@12 + 11 bytes KERNEL32! WaitForSingleObjectEx@12 + 90 bytes KERNEL32! WaitForSingleObject@8 + 15 bytes ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00f78e94) line 6313 + 11 bytes ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0x00151388) line 625 + 9 bytes TAO_Leader_Follower::wait_for_client_leader_to_complete(TAO_Leader_Follower * const 0x00151388, ACE_Time_Value * 0x0024d68c) line 79 + 9 bytes TAO_Leader_Follower::set_event_loop_thread(TAO_Leader_Follower * const 0x00151388, ACE_Time_Value * 0x00000000) line 61 TAO_LF_Strategy_Complete::set_event_loop_thread(TAO_LF_Strategy_Complete * const 0x00151388, ACE_Time_Value * 0xffffffff, TAO_Leader_Follower & {...}) line 29 + 14 bytes 0014aa60() 00f86b98()
Created attachment 82 [details] Deadlock after socket error testcase
I get the problem to reproduce in release mode only with the following client/server settings: client -b 5000000 -k server.ior server -o server.ior The client runs 10 threads, the server runs 20 ORB threads
I forgot to mention that the first stack trace is of the thread that reported the insufficient buffer space socket error.
"If I set the orb option to -ORBFlushingStrategy full I cannot reproduce the problem." -- Jon, are you sure that your above observation is right? AFAIK & AFAICS, I dont see any "full" flushing strategy. I see only "leader_follower", "reactive" and "blocking". If you try to set it to a wrong value it should use the default, which is a "leader_follower". I am a bit confused on this. Moreover, I guess that the system has closed the connection on which the error occured. The reply was not sent back and the client thread who sent the request still tries to wait for its reply. I am just guessing as I have seen this before. Can you please try to take a peek at your client side? Are you able to see a leader thread in wait_for_event () in the Leader_Follower? Please let us know.
Hi Bala, sorry it has taken me this long to respond but our network has been down due to a nasty virus. You are right I had the wrong parameter set for the -ORBFlushingStrategy, sorry about that. I am still seeing a problem however with the testcase I submitted. It seems to only cause problems on our faster machines. Interestingly, it is easier to reproduce if the number of ORB threads in the server.cpp file is changed from 5 to 20. I would appreciate it if you could try the testcase out and run it with the server having 20 threads and the client buffer request size set to 10MB. ie client -k server.ior -b 10000000 The client thread stacks look like this: One thread in select ++++++++++++++++++++ NTDLL! ZwWaitForSingleObject@12 + 11 bytes MSAFD! SockWaitForSingleObject@16 + 306 bytes MSAFD! WSPSelect@24 + 1021 bytes WS2_32! select@20 + 203 bytes ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> >::wait_for_multiple_events (ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> > * const 0x00000210, ACE_Select_Reactor_Handle_Set & {...}, ACE_Time_Value * 0x00000000) line 1013 + 54 bytes ACE_TP_Reactor::get_event_for_dispatching(ACE_TP_Reactor * const 0x00000210, ACE_Time_Value * 0x00000000) line 462 + 15 bytes ACE_TP_Reactor::dispatch_i(ACE_TP_Reactor * const 0x00000210, ACE_Time_Value * 0x00000000, ACE_TP_Token_Guard & {...}) line 152 ACE_TP_Reactor::handle_events(ACE_TP_Reactor * const 0x00000210, ACE_Time_Value * 0x00000000) line 134 + 14 bytes ACE_Reactor::handle_events(ACE_Reactor * const 0x00000210, ACE_Time_Value * 0x00000000) line 158 TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x00000210, TAO_LF_Event * 0x0132fac8, TAO_Transport * 0x009d29b0, ACE_Time_Value * 0x00000000) line 380 TAO_Wait_On_Leader_Follower::wait(TAO_Wait_On_Leader_Follower * const 0x00000210, ACE_Time_Value * 0x00000000, TAO_Synch_Reply_Dispatcher & {...}) line 57 + 30 bytes TAO_GIOP_Synch_Invocation::invoke_i(TAO_GIOP_Synch_Invocation * const 0x00000210, unsigned char 0, CORBA_Environment & {...}) line 677 + 135 bytes TAO_GIOP_Twoway_Invocation::invoke(TAO_GIOP_Twoway_Invocation * const 0x00000210, TAO_Exception_Data * 0x00413250, unsigned int 1, CORBA_Environment & {...}) line 825 CLIENT! 00402e62() CLIENT! 00403f53() ACE_Thread_Adapter::invoke_i(ACE_Thread_Adapter * const 0x00000210) line 148 + 3 bytes ACE_Thread_Adapter::invoke(ACE_Thread_Adapter * const 0x00000210) line 91 + 7 bytes MSVCRT! 7800a3c0() KERNEL 4 Other threads in wait_for_event +++++++++++++++++++++++++++++++++ NTDLL! ZwWaitForSingleObject@12 + 11 bytes KERNEL32! WaitForSingleObjectEx@12 + 90 bytes KERNEL32! WaitForSingleObject@8 + 15 bytes ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00992c6c) line 6313 + 11 bytes ACE_OS::cond_timedwait(ACE_cond_t * 0x009cbe6c, _RTL_CRITICAL_SECTION * 0x00992c6c, ACE_Time_Value * 0x00000000) line 6205 + 11 bytes ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0x00000001, const ACE_Time_Value * 0x00000000) line 642 + 14 bytes TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x00000001, TAO_LF_Event * 0x0112fac8, TAO_Transport * 0x009c7968, ACE_Time_Value * 0x00000000) line 245 + 14 bytes TAO_Wait_On_Leader_Follower::wait(TAO_Wait_On_Leader_Follower * const 0x00000001, ACE_Time_Value * 0x00000000, TAO_Synch_Reply_Dispatcher & {...}) line 57 + 30 bytes TAO_GIOP_Synch_Invocation::invoke_i(TAO_GIOP_Synch_Invocation * const 0x00000001, unsigned char 0, CORBA_Environment & {...}) line 677 + 135 bytes TAO_GIOP_Twoway_Invocation::invoke(TAO_GIOP_Twoway_Invocation * const 0x00000001, TAO_Exception_Data * 0x00413250, unsigned int 1, CORBA_Environment & {...}) line 825 CLIENT! 00402e62() CLIENT! 00403f53() ACE_Thread_Adapter::invoke_i(ACE_Thread_Adapter * const 0x00000001) line 148 + 3 bytes ACE_Thread_Adapter::invoke(ACE_Thread_Adapter * const 0x00000001) line 91 + 7 bytes
Hi Bala, if you have time to look at the testcase, could you also please monitor the memory usage. If you cntrl-c the client serveral times during the transfer of a large buffer, the memory on the server seems to grow without bound.
FWIW, I havent been able to reproduce this on a fast Linux box here. Just FYI.
Hi Bala, the patch below does not fix the problem. Please go ahead and defer the bug. As for the memory growth, it appears that if I increase the number of threads on the server to 20, and run 4 clients all accessing 10MB buffers, the server quickly grows to 250MB and spends most of its time swapping virtual memory in and out. It seems that you are keeping buffers around on a thread specific basis and never clearing these buffers out after the transfer completes. Probably the next time that thread is used, the buffer is freed and replaced with a new buffer, so cumulative growth is not seen. Is this possible? Anyway, neither bug is a show-stopper. We can break up our large buffer requests into smaller ones if need be. Jonathan +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Patch supplied by Bala +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ int TAO_IIOP_Transport::send_message (TAO_OutputCDR &stream, TAO_Stub *stub, int twoway, ACE_Time_Value *max_wait_time) { // Format the message in the stream first if (this->messaging_object_->format_message (stream) != 0) return -1; // This guarantees to send all data (bytes) or return an error. ssize_t n = this->send_message_i (stub, twoway, stream.begin (), max_wait_time); if (n == -1) { if (TAO_debug_level) ACE_DEBUG ((LM_DEBUG, ACE_TEXT ("TAO: (%P|%t|%N|%l) closing transport %d after fault %p\n"), this->id (), ACE_TEXT ("send_message ()\n"))); this->tms_->connection_closed (); return -1; } return 1; }
I have been able to reproduce the memory growth problem. I kind of know where the other problem that Jon actually reported could be stemming from. I accepting the bug.
Assigning the bug to me
Accepting the bug.
Here is my analysis of why there is a memory hog. Below is the extract from my email to folks like Chris (C), Chad and Carlos. bala@cs.wustl.edu writes: -------------------- I think we have a semi-serious problem on hand. This gives rise to memory growth problems and it can be serious. This can also lead to weird behaviours which I am not ready to think about. Here is how it occurs. Imagine a a server with two threads and a client with say 10 threads. The server as a part of the reply needs to send large data, say 10 MB. Below is a sequence of steps 1. The connections are muxed. So there may be more than one message in a single read for a thread 2. As a result of the above we send notifies to the reactor, which unblocks the next thread and sends it to the same handler. 3. At step #2 the number of upcalls on the handler is 3. As there are two threads in the handler this is fine. 4. The first thread tries to send a 10 MB reply. It is not able to, and so schedules output and waits on the reactor. 5. The same happens for thread 2. Note at this point the number of upcalls havent dropped as the handle_input () is not complete for both the threads 6. As the threads have gone to the reactor, the reactor can potentially wake them up and send to the same handler as there could be more events. 7. Now the upcall goes to 5 and both the threads go to the reactor again. 8. After sometime, if the client crashes, we just call handle_close () which decrements the upcall count by 1. If the upcall count is not 0 we dont do anything. We expect all the threads (only 2 in our case) to unwind and decrement the upcall count. 9. That just doesnt happen and there is a leak. Worse still -- the transports are in cache. 10. The above could happen for >=1 connection in the above setup. 11. The worst part would start occuring if the servers were really making remote calls back to the client. They could try to call clients that are not available and possibly crash. I have been able to reproduce whatever I have said (from 1-10). Any ideas? The memory hog comes from the lingering transports, the messages queued on them and its own memory. -------End mail------------------------------------- I dont think we have time before 1.2/1.2.1 to fix these. We need to look into these once we start working on connections again, may be when GIOP 1.3 comes out or when we work on the reactor or anything connected to this area.
Hi Bala, is setting the -ORBFlushingStrategy to 'blocking' a workaround for the memory leak? If so, what behavior difference would I see from my server? Jonathan
I have been able to detect the source of the deadlock problem that Jon reported (though I havent been to reproduce it). Here is the text from a few emails that I sent to Chris (C), Chad & Carlos. -------------- Emails ------------------------- Hi, The following pieces of code smells foul to me void TAO_ServerRequest::tao_send_reply (void) { int result = this->transport_->send_message (*this->outgoing_); if (result == -1) { if (TAO_debug_level > 0) { // No exception but some kind of error, yet a response // is required. ACE_ERROR ((LM_ERROR, ACE_TEXT ("TAO: (%P|%t) %p: cannot send reply\n"), ACE_TEXT ("TAO_ServerRequest::tao_send_reply"))); } } } The variations of this methods like tao_send_* also smell fishy. The problem is that if the send fails we dont do anything. The client thread that waits for the reply keeps waiting for the reply. This could lead to deadlock for a TAO client as he would be expecting a reply. IMHO, this is bad. ------------- In another email that followed I wrote ----------------- Looks like doing the right thing may not be that easy. (else I would have done that by now :-)). Ok, here is the trail of thought on this and I need you guys to let me know whether I am thinking in the wrong direction. What we need to do above is to close the connection ie. the socket should be closed, at the very least, so that the clients thread is not waiting for a reply. But looks like that is not easy. We have to clean up the cache and also remove the entry from the reactor map. We need to call a remove_handler () to get it off the map. But the handle_close () that would be called will have zero effect as the connection handler may not be closed (there could have been multiple threads that passed thro the handler). So we want the connection handler and transport to be alive (so that all the threads can exit safely), but the socket closed (so that the client can detect connection closure) and also make sure that we dont close the socket again when we delete the connection handler (basically we need to split closing of the underlying socket and the deletion of the connection handler object, which seems to go against the abstraction of the Event/Svc Handler). Added to this we need to make sure that lock the closing of the sockets as we dont want another thread reading from it when the socket is closed. We then need to drain the queues in the transport when all the threads exit so that we prevent any memory hogs (this is being done now and so not a problem). Let me know where my understanding of this is wrong or if the demands that I have placed is all wrong. ------End emails ----------------------------------------- The end result is that we do have a problem and we need to decide on a corrective action. When an how is open? I am adding a dependency on 1031. May be we should see whether we can cover this up in 1031.
An additional problem that Bala has asked me to log along with this bug is that -ORBFlushingStrategy blocking does not seem to work properly. This is based on conjecture from observations seen with the memory leaks. The memory that is being leaked is from copy of the buffer that the TP_Reactor makes as when the socket becomes blocked and the thread is freed up to go service other things. If I set the -ORBFlushingStrategy to blocking, I would expect that a buffer copy would NOT occur since the thread blocks on the socket and is not allowed to return to the pool until the data is successful sent to the client. Therefore, I should not see the memory leak reported in this bug if I run in that mode, but I do. I have included a testcase that creates a global buffer (buffer_) that is reused by all request to the client using: return new File::Descriptor::DataBuffer (num_bytes, num_bytes, buffer_, 0); This testcase is helpfull in reproducing this bug and the original bug that initiated this bug report. The new testcase seems to produce the bug faster than the origninal one.
Created attachment 83 [details] Testcase that creates the problem faster (at least on W2K)
Bugs relating to the 'Memory Hog' issue and '-ORBFlushingStrategy blocking' have been moved from here to bug 1025. This bug should only be updated to reflect changes to the "Deadlock on client after socket failure due to isufficient buffer space" issue. Thanks.
One of the recommendations from an older Microsoft Knowledge Base Article (Q201213), for a different but related large buffer sending problem on OS (NT 4.0), is to: "Set the SO_SNDBUF socket option to 0 (zero) to allow the stack to send from your application buffer directly." This may work for this problem also, but I don't know what the tradeoff of having the OS use the buffer from the application directly would be. Off hand, it sounds like a good thing for large buffers and how the TP_Reactor works??? I'm going to modify my local version of TAO_Connection_Handler::set_socket_option to see if this makes any difference.
I tried this: "Set the SO_SNDBUF socket option to 0 (zero) to allow the stack to send from your application buffer directly." and still produced the same "system lacked sufficient resources..." error. BTW, I had to make a slight modification to the ORB_Core.cpp file to use - ORBSndSock 0. This is because of the check: if (snd_sock_size != 0) this->orb_params ()->sock_sndbuf_size (snd_sock_size); which should probably be checking for -1, since 0 has a specific meaning, ie use the application buffer space and not the OS's. So it's still a mystery to me why the OS freaks. I'll try and investigate it further.
to pool