Bug 1038

Summary: Deadlock on client after socket failure due to insufficient buffer space or because queue was full
Product: TAO Reporter: Nanbor Wang <reis>
Component: ORBAssignee: DOC Center Support List (internal) <tao-support>
Status: NEW ---    
Severity: normal    
Priority: P3    
Version: 1.1.20   
Hardware: x86   
OS: Windows 2000   
Bug Depends on: 1031    
Bug Blocks:    
Attachments: Deadlock after socket error testcase
Testcase that creates the problem faster (at least on W2K)

Description Nanbor Wang 2001-09-18 23:51:27 CDT
Our server application is deadlocking while processing large octet sequences 
using the TAO_Leader_Follower_Flushing_Strategy. This seems to occur after the 
following message is reported by the orb with -ORBDebugLevel set to 1:

Sep 18 11:37:51.953 2001@LM_ERROR@TAO: (7036|2380) TAO_ServerRequest::tao_send_r
eply: An operation on a socket could not be performed because the system lacked
sufficient buffer space or because a queue was full. : cannot send reply

This cannot be reproduced with 1.1.17.

I have modified the File_IO testcase and can reproduce the problem fairly 
easily. I will attach this testcase to the bug.

If I set the orb option to -ORBFlushingStrategy full I cannot reproduce the 
problem.


Stack traces of all the active threads are included below:
NTDLL! ZwWaitForSingleObject@12 + 11 bytes
MSAFD! SockWaitForSingleObject@16 + 306 bytes
MSAFD! WSPSelect@24 + 1021 bytes
WS2_32! select@20 + 203 bytes
ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> 
>::wait_for_multiple_events
(ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> > * const 
0x00000002, ACE_Select_Reactor_Handle_Set & {...}, ACE_Time_Value * 0x00000000) 
line 1013 + 54 bytes
ACE_TP_Reactor::get_event_for_dispatching(ACE_TP_Reactor * const 0x00000002, 
ACE_Time_Value * 0x00000000) line 462 + 15 bytes
ACE_TP_Reactor::dispatch_i(ACE_TP_Reactor * const 0x00000002, ACE_Time_Value * 
0x00000000, ACE_TP_Token_Guard & {...}) line 152
ACE_TP_Reactor::handle_events(ACE_TP_Reactor * const 0x00000002, ACE_Time_Value 
* 0x00000000) line 134 + 14 bytes
ACE_Reactor::handle_events(ACE_Reactor * const 0x00000002, ACE_Time_Value * 
0x00000000) line 158
TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x00000002, 
TAO_LF_Event * 0x0191a9a0, TAO_Transport * 0x01004680, ACE_Time_Value * 
0x00000000) line 380
TAO_Leader_Follower_Flushing_Strategy::flush_message
(TAO_Leader_Follower_Flushing_Strategy * const 0x00000002, TAO_Transport * 
0x01004680, TAO_Queued_Message * 0x0191a9a0, ACE_Time_Value * 0x00000000) line 
32 + 16 bytes
TAO_Transport::send_synchronous_message_i(TAO_Transport * const 0x00000002, 
const ACE_Message_Block * 0x010046a8, ACE_Time_Value * 0x00000000) line 481
TAO_Transport::send_message_i(TAO_Transport * const 0x00000002, TAO_Stub * 
0x00000000, int 1, const ACE_Message_Block * 0x0191b060, ACE_Time_Value * 
0x00000000) line 301 + 13 bytes
TAO_IIOP_Transport::send_message(TAO_IIOP_Transport * const 0x00000002, 
TAO_OutputCDR & {...}, TAO_Stub * 0x00000000, int 1, ACE_Time_Value * 
0x00000000) line 185
TAO_ServerRequest::tao_send_reply(TAO_ServerRequest * const 0x00000002) line 232
TAO_ServantBase::synchronous_upcall_dispatch(TAO_ServantBase * const 
0x00000002, TAO_ServerRequest & {...}, void * 0x0191ab10, void * 0x01b80110, 
CORBA_Environment & {...}) line 262
POA_DTSStudyFile::DTSStudy::_dispatch(POA_DTSStudyFile::DTSStudy * const 
0x00000002, TAO_ServerRequest & {...}, void * 0x0191ab10, CORBA_Environment & 
{...}) line 3550
TAO_Default_Servant_Dispatcher::dispatch(TAO_Default_Servant_Dispatcher * const 
0x00000002, TAO_Object_Adapter::Servant_Upcall & {...}, TAO_ServerRequest & 
{...}, CORBA_Environment & {...}) line 21
TAO_Object_Adapter::dispatch_servant(TAO_Object_Adapter * const 0x00000002, 
const TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 323 + 16 bytes
TAO_Object_Adapter::dispatch(TAO_Object_Adapter * const 0x00000002, 
TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 733
TAO_Adapter_Registry::dispatch(TAO_Adapter_Registry * const 0x00000002, 
TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 120
TAO_GIOP_Message_Base::process_request(TAO_GIOP_Message_Base * const 
0x00000002, TAO_Transport * 0x01004680, TAO_InputCDR & {...}, TAO_OutputCDR & 
{...}) line 778
TAO_GIOP_Message_Base::process_request_message(TAO_GIOP_Message_Base * const 
0x00000002, TAO_Transport * 0x01004680, TAO_Queued_Data * 0x01b4f702) line 594 
+ 21 bytes
TAO_Transport::process_parsed_messages(TAO_Transport * const 0x00000002, 
TAO_Queued_Data * 0x01b4e3f0, TAO_Resume_Handle & {...}) line 1361 + 16 bytes
TAO_Transport::process_queue_head(TAO_Transport * const 0x00000002, 
TAO_Resume_Handle & {...}) line 1549 + 16 bytes
TAO_Transport::handle_input_i(TAO_Transport * const 0x00000002, 
TAO_Resume_Handle & {...}, ACE_Time_Value * 0xff335756, int -1660142279) line 
782
df75e810()
0d05bcbc()


NTDLL! ZwWaitForSingleObject@12 + 11 bytes
KERNEL32! WaitForSingleObjectEx@12 + 90 bytes
KERNEL32! WaitForSingleObject@8 + 15 bytes
ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00f78e94) 
line 6313 + 11 bytes
ACE_OS::cond_timedwait(ACE_cond_t * 0x01bd95e4, _RTL_CRITICAL_SECTION * 
0x00f78e94, ACE_Time_Value * 0x00000000) line 6205 + 11 bytes
ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0x7ffd7000, 
const ACE_Time_Value * 0x00000000) line 642 + 14 bytes
TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x7ffd7000, 
TAO_LF_Event * 0x01a1f054, TAO_Transport * 0x01004680, ACE_Time_Value * 
0x00000000) line 245 + 14 bytes
TAO_Leader_Follower_Flushing_Strategy::flush_message
(TAO_Leader_Follower_Flushing_Strategy * const 0x7ffd7000, TAO_Transport * 
0x01004680, TAO_Queued_Message * 0x01a1f054, ACE_Time_Value * 0x00000000) line 
32 + 16 bytes
TAO_Transport::send_synchronous_message_i(TAO_Transport * const 0x7ffd7000, 
const ACE_Message_Block * 0x010046a8, ACE_Time_Value * 0x00000000) line 481
TAO_Transport::send_message_i(TAO_Transport * const 0x7ffd7000, TAO_Stub * 
0x00000000, int 1, const ACE_Message_Block * 0x01a1f714, ACE_Time_Value * 
0x00000000) line 301 + 13 bytes
TAO_IIOP_Transport::send_message(TAO_IIOP_Transport * const 0x7ffd7000, 
TAO_OutputCDR & {...}, TAO_Stub * 0x00000000, int 1, ACE_Time_Value * 
0x00000000) line 185
TAO_ServerRequest::tao_send_reply(TAO_ServerRequest * const 0x7ffd7000) line 232
TAO_ServantBase::synchronous_upcall_dispatch(TAO_ServantBase * const 
0x7ffd7000, TAO_ServerRequest & {...}, void * 0x01a1f1c4, void * 0x01b80110, 
CORBA_Environment & {...}) line 262
POA_DTSStudyFile::DTSStudy::_dispatch(POA_DTSStudyFile::DTSStudy * const 
0x7ffd7000, TAO_ServerRequest & {...}, void * 0x01a1f1c4, CORBA_Environment & 
{...}) line 3550
TAO_Default_Servant_Dispatcher::dispatch(TAO_Default_Servant_Dispatcher * const 
0x7ffd7000, TAO_Object_Adapter::Servant_Upcall & {...}, TAO_ServerRequest & 
{...}, CORBA_Environment & {...}) line 21
TAO_Object_Adapter::dispatch_servant(TAO_Object_Adapter * const 0x7ffd7000, 
const TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 323 + 16 bytes
TAO_Object_Adapter::dispatch(TAO_Object_Adapter * const 0x7ffd7000, 
TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 733
TAO_Adapter_Registry::dispatch(TAO_Adapter_Registry * const 0x7ffd7000, 
TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 120
TAO_GIOP_Message_Base::process_request(TAO_GIOP_Message_Base * const 
0x7ffd7000, TAO_Transport * 0x01004680, TAO_InputCDR & {...}, TAO_OutputCDR & 
{...}) line 778
TAO_GIOP_Message_Base::process_request_message(TAO_GIOP_Message_Base * const 
0x7ffd7000, TAO_Transport * 0x01004680, TAO_Queued_Data * 0x01b4ff02) line 594 
+ 21 bytes
TAO_Transport::process_parsed_messages(TAO_Transport * const 0x7ffd7000, 
TAO_Queued_Data * 0x01d1af08, TAO_Resume_Handle & {...}) line 1361 + 16 bytes
TAO_Transport::process_queue_head(TAO_Transport * const 0x7ffd7000, 
TAO_Resume_Handle & {...}) line 1549 + 16 bytes
TAO_Transport::consolidate_extra_messages(TAO_Transport * const 0x7ffd7000, 
ACE_Message_Block & {...}, TAO_Resume_Handle & {...}) line 1327 + 10 bytes
TAO_Transport::parse_consolidate_messages(TAO_Transport * const 0x7ffd7000, 
ACE_Message_Block & {...}, TAO_Resume_Handle & {...}, ACE_Time_Value * 
0x00000000) line 910 + 12 bytes
TAO_Transport::handle_input_i(TAO_Transport * const 0x7ffd7000, 
TAO_Resume_Handle & {...}, ACE_Time_Value * 0x00000000, int 0) line 859
TAO_IIOP_Connection_Handler::handle_input(TAO_IIOP_Connection_Handler * const 
0x7ffd7000, void * 0x00000434) line 336
ACE_TP_Reactor::dispatch_socket_event(ACE_TP_Reactor * const 0x7ffd7000, 
ACE_EH_Dispatch_Info & {...}) line 569 + 6 bytes
ACE_TP_Reactor::handle_socket_events(ACE_TP_Reactor * const 0x7ffd7000, int & 
0, ACE_TP_Token_Guard & {...}) line 375 + 13 bytes
ACE_TP_Reactor::dispatch_i(ACE_TP_Reactor * const 0x7ffd7000, ACE_Time_Value * 
0x00000000, ACE_TP_Token_Guard & {...}) line 201 + 14 bytes
ACE_TP_Reactor::handle_events(ACE_TP_Reactor * const 0x7ffd7000, ACE_Time_Value 
* 0x00000000) line 134 + 14 bytes
ACE_Reactor::handle_events(ACE_Reactor * const 0x7ffd7000, ACE_Time_Value * 
0x00000000) line 158
TAO_ORB_Core::run(TAO_ORB_Core * const 0x7ffd7000, ACE_Time_Value * 0x00000000, 
int 0, CORBA_Environment & {...}) line 2117
CORBA_ORB::run(CORBA_ORB * const 0x7ffd7000, ACE_Time_Value * 0x00000000, 
CORBA_Environment & {...}) line 248 + 18 bytes
CORBA_ORB::run(CORBA_ORB * const 0x7ffd7000, CORBA_Environment & {...}) line 233
TAO_ORB_Manager::run(TAO_ORB_Manager * const 0x7ffd7000, CORBA_Environment & 
{...}) line 319
DLL_ORB::svc(DLL_ORB * const 0x7ffd7000) line 60
ACE_Task_Base::svc_run(void * 0x00f6cc18) line 211
ACE_Thread_Hook::start(ACE_Thread_Hook * const 0x7ffd7000, void * (void *)* 
0x1005496e ACE_Task_Base::svc_run(void *), void * 0x00f6cc18) line 11 + 8 bytes
Thread_Hook::start(Thread_Hook * const 0x7ffd7000, void * (void *)* 0x1005496e 
ACE_Task_Base::svc_run(void *), void * 0x00f6cc18) line 26
ACE_Thread_Adapter::invoke_i(ACE_Thread_Adapter * const 0x7ffd7000) line 142
ACE_Thread_Adapter::invoke(ACE_Thread_Adapter * const 0x7ffd7000) line 91 + 7 
bytes
MSVCRT! 7800a3c0()
KERNEL32! 


NTDLL! ZwWaitForSingleObject@12 + 11 bytes
KERNEL32! WaitForSingleObjectEx@12 + 90 bytes
KERNEL32! WaitForSingleObject@8 + 15 bytes
ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00f78e94) 
line 6313 + 11 bytes
ACE_OS::cond_timedwait(ACE_cond_t * 0x00f741a4, _RTL_CRITICAL_SECTION * 
0x00f78e94, ACE_Time_Value * 0x00000000) line 6205 + 11 bytes
ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0xffffffff, 
const ACE_Time_Value * 0x00000000) line 642 + 14 bytes
TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0xffffffff, 
TAO_LF_Event * 0x0171f0a4, TAO_Transport * 0x01004680, ACE_Time_Value * 
0x00000000) line 245 + 14 bytes
TAO_Leader_Follower_Flushing_Strategy::flush_message
(TAO_Leader_Follower_Flushing_Strategy * const 0xffffffff, TAO_Transport * 
0x01004680, TAO_Queued_Message * 0x0171f0a4, ACE_Time_Value * 0x00000000) line 
32 + 16 bytes
TAO_Transport::send_synchronous_message_i(TAO_Transport * const 0xffffffff, 
const ACE_Message_Block * 0x010046a8, ACE_Time_Value * 0x00000000) line 481
TAO_Transport::send_message_i(TAO_Transport * const 0xffffffff, TAO_Stub * 
0x00000000, int 1, const ACE_Message_Block * 0x0171f764, ACE_Time_Value * 
0x00000000) line 301 + 13 bytes
TAO_IIOP_Transport::send_message(TAO_IIOP_Transport * const 0xffffffff, 
TAO_OutputCDR & {...}, TAO_Stub * 0x00000000, int 1, ACE_Time_Value * 
0x00000000) line 185
TAO_ServerRequest::tao_send_reply(TAO_ServerRequest * const 0xffffffff) line 232
TAO_ServantBase::synchronous_upcall_dispatch(TAO_ServantBase * const 
0xffffffff, TAO_ServerRequest & {...}, void * 0x0171f214, void * 0x01b80110, 
CORBA_Environment & {...}) line 262
POA_DTSStudyFile::DTSStudy::_dispatch(POA_DTSStudyFile::DTSStudy * const 
0xffffffff, TAO_ServerRequest & {...}, void * 0x0171f214, CORBA_Environment & 
{...}) line 3550
TAO_Default_Servant_Dispatcher::dispatch(TAO_Default_Servant_Dispatcher * const 
0xffffffff, TAO_Object_Adapter::Servant_Upcall & {...}, TAO_ServerRequest & 
{...}, CORBA_Environment & {...}) line 21
TAO_Object_Adapter::dispatch_servant(TAO_Object_Adapter * const 0xffffffff, 
const TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 323 + 16 bytes
TAO_Object_Adapter::dispatch(TAO_Object_Adapter * const 0xffffffff, 
TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 733
TAO_Adapter_Registry::dispatch(TAO_Adapter_Registry * const 0xffffffff, 
TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 120
TAO_GIOP_Message_Base::process_request(TAO_GIOP_Message_Base * const 
0xffffffff, TAO_Transport * 0x01004680, TAO_InputCDR & {...}, TAO_OutputCDR & 
{...}) line 778
TAO_GIOP_Message_Base::process_request_message(TAO_GIOP_Message_Base * const 
0xffffffff, TAO_Transport * 0x01004680, TAO_Queued_Data * 0x01b4fc02) line 594 
+ 21 bytes
TAO_Transport::process_parsed_messages(TAO_Transport * const 0xffffffff, 
TAO_Queued_Data * 0x01d1adf8, TAO_Resume_Handle & {...}) line 1361 + 16 bytes
TAO_Transport::process_queue_head(TAO_Transport * const 0xffffffff, 
TAO_Resume_Handle & {...}) line 1549 + 16 bytes
TAO_Transport::handle_input_i(TAO_Transport * const 0xffffffff, 
TAO_Resume_Handle & {...}, ACE_Time_Value * 0xff335756, int -1660142279) line 
782
df75e810()
0d05bcbc()


NTDLL! ZwWaitForSingleObject@12 + 11 bytes
KERNEL32! WaitForSingleObjectEx@12 + 90 bytes
KERNEL32! WaitForSingleObject@8 + 15 bytes
ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00f78e94) 
line 6313 + 11 bytes
ACE_OS::cond_timedwait(ACE_cond_t * 0x01b80dfc, _RTL_CRITICAL_SECTION * 
0x00f78e94, ACE_Time_Value * 0x00000000) line 6205 + 11 bytes
ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0x00000316, 
const ACE_Time_Value * 0x00000000) line 642 + 14 bytes
TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x00000316, 
TAO_LF_Event * 0x0181f0a4, TAO_Transport * 0x01004680, ACE_Time_Value * 
0x00000000) line 245 + 14 bytes
TAO_Leader_Follower_Flushing_Strategy::flush_message
(TAO_Leader_Follower_Flushing_Strategy * const 0x00000316, TAO_Transport * 
0x01004680, TAO_Queued_Message * 0x0181f0a4, ACE_Time_Value * 0x00000000) line 
32 + 16 bytes
TAO_Transport::send_synchronous_message_i(TAO_Transport * const 0x00000316, 
const ACE_Message_Block * 0x010046a8, ACE_Time_Value * 0x00000000) line 481
TAO_Transport::send_message_i(TAO_Transport * const 0x00000316, TAO_Stub * 
0x00000000, int 1, const ACE_Message_Block * 0x0181f764, ACE_Time_Value * 
0x00000000) line 301 + 13 bytes
TAO_IIOP_Transport::send_message(TAO_IIOP_Transport * const 0x00000316, 
TAO_OutputCDR & {...}, TAO_Stub * 0x00000000, int 1, ACE_Time_Value * 
0x00000000) line 185
TAO_ServerRequest::tao_send_reply(TAO_ServerRequest * const 0x00000316) line 232
TAO_ServantBase::synchronous_upcall_dispatch(TAO_ServantBase * const 
0x00000316, TAO_ServerRequest & {...}, void * 0x0181f214, void * 0x01b80110, 
CORBA_Environment & {...}) line 262
POA_DTSStudyFile::DTSStudy::_dispatch(POA_DTSStudyFile::DTSStudy * const 
0x00000316, TAO_ServerRequest & {...}, void * 0x0181f214, CORBA_Environment & 
{...}) line 3550
TAO_Default_Servant_Dispatcher::dispatch(TAO_Default_Servant_Dispatcher * const 
0x00000316, TAO_Object_Adapter::Servant_Upcall & {...}, TAO_ServerRequest & 
{...}, CORBA_Environment & {...}) line 21
TAO_Object_Adapter::dispatch_servant(TAO_Object_Adapter * const 0x00000316, 
const TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 323 + 16 bytes
TAO_Object_Adapter::dispatch(TAO_Object_Adapter * const 0x00000316, 
TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 733
TAO_Adapter_Registry::dispatch(TAO_Adapter_Registry * const 0x00000316, 
TAO_ObjectKey & {...}, TAO_ServerRequest & {...}, CORBA_Object_out {...}, 
CORBA_Environment & {...}) line 120
TAO_GIOP_Message_Base::process_request(TAO_GIOP_Message_Base * const 
0x00000316, TAO_Transport * 0x01004680, TAO_InputCDR & {...}, TAO_OutputCDR & 
{...}) line 778
TAO_GIOP_Message_Base::process_request_message(TAO_GIOP_Message_Base * const 
0x00000316, TAO_Transport * 0x01004680, TAO_Queued_Data * 0x0100e802) line 594 
+ 21 bytes
TAO_Transport::process_parsed_messages(TAO_Transport * const 0x00000316, 
TAO_Queued_Data * 0x01b66ff8, TAO_Resume_Handle & {...}) line 1361 + 16 bytes
TAO_Transport::process_queue_head(TAO_Transport * const 0x00000316, 
TAO_Resume_Handle & {...}) line 1549 + 16 bytes
TAO_Transport::handle_input_i(TAO_Transport * const 0x00000316, 
TAO_Resume_Handle & {...}, ACE_Time_Value * 0xff335756, int -1660142279) line 
782
df75e810()
0d05bcbc()


NTDLL! ZwWaitForSingleObject@12 + 11 bytes
KERNEL32! WaitForSingleObjectEx@12 + 90 bytes
KERNEL32! WaitForSingleObject@8 + 15 bytes
ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00f78e94) 
line 6313 + 11 bytes
ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0x00151388) 
line 625 + 9 bytes
TAO_Leader_Follower::wait_for_client_leader_to_complete(TAO_Leader_Follower * 
const 0x00151388, ACE_Time_Value * 0x0024d68c) line 79 + 9 bytes
TAO_Leader_Follower::set_event_loop_thread(TAO_Leader_Follower * const 
0x00151388, ACE_Time_Value * 0x00000000) line 61
TAO_LF_Strategy_Complete::set_event_loop_thread(TAO_LF_Strategy_Complete * 
const 0x00151388, ACE_Time_Value * 0xffffffff, TAO_Leader_Follower & {...}) 
line 29 + 14 bytes
0014aa60()
00f86b98()
Comment 1 Nanbor Wang 2001-09-18 23:52:26 CDT
Created attachment 82 [details]
Deadlock after socket error testcase
Comment 2 Nanbor Wang 2001-09-18 23:54:12 CDT
I get the problem to reproduce in release mode only with the following 
client/server settings:

client -b 5000000 -k server.ior
server -o server.ior

The client runs 10 threads, the server runs 20 ORB threads
Comment 3 Nanbor Wang 2001-09-18 23:58:18 CDT
I forgot to mention that the first stack trace is of the thread that reported 
the insufficient buffer space socket error.
Comment 4 Nanbor Wang 2001-09-19 07:03:04 CDT
"If I set the orb option to -ORBFlushingStrategy full I cannot reproduce the 
problem." -- Jon, are you sure that your above observation is right? AFAIK & 
AFAICS, I dont see any "full" flushing strategy. I see 
only "leader_follower", "reactive" and "blocking". If you try to set it to a 
wrong value it should use the default, which is a "leader_follower". I am a bit 
confused on this.  

Moreover, I guess that the system has closed the connection on which the error 
occured. The reply was not sent back and the client thread who sent the request 
still tries to wait for its reply. I am just guessing as I have seen this 
before. Can you please try to take a peek at your client side? Are you able to 
see a leader thread in wait_for_event () in the Leader_Follower?  Please let us 
know.
Comment 5 Nanbor Wang 2001-09-20 16:51:44 CDT
Hi Bala, sorry it has taken me this long to respond but our network has been 
down due to a nasty virus.

You are right I had the wrong parameter set for the -ORBFlushingStrategy, sorry 
about that. 

I am still seeing a problem however with the testcase I submitted. It seems to 
only cause problems on our faster machines. Interestingly, it is easier to 
reproduce if the number of ORB threads in the server.cpp file is changed from 5 
to 20.

I would appreciate it if you could try the testcase out and run it with the 
server having 20 threads and the client buffer request size set to 10MB. ie
client -k server.ior -b 10000000

The client thread stacks look like this:

One thread in select
++++++++++++++++++++
NTDLL! ZwWaitForSingleObject@12 + 11 bytes
MSAFD! SockWaitForSingleObject@16 + 306 bytes
MSAFD! WSPSelect@24 + 1021 bytes
WS2_32! select@20 + 203 bytes
ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> 
>::wait_for_multiple_events
(ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> > * const 
0x00000210, ACE_Select_Reactor_Handle_Set & {...}, ACE_Time_Value * 0x00000000) 
line 1013 + 54 bytes
ACE_TP_Reactor::get_event_for_dispatching(ACE_TP_Reactor * const 0x00000210, 
ACE_Time_Value * 0x00000000) line 462 + 15 bytes
ACE_TP_Reactor::dispatch_i(ACE_TP_Reactor * const 0x00000210, ACE_Time_Value * 
0x00000000, ACE_TP_Token_Guard & {...}) line 152
ACE_TP_Reactor::handle_events(ACE_TP_Reactor * const 0x00000210, ACE_Time_Value 
* 0x00000000) line 134 + 14 bytes
ACE_Reactor::handle_events(ACE_Reactor * const 0x00000210, ACE_Time_Value * 
0x00000000) line 158
TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x00000210, 
TAO_LF_Event * 0x0132fac8, TAO_Transport * 0x009d29b0, ACE_Time_Value * 
0x00000000) line 380
TAO_Wait_On_Leader_Follower::wait(TAO_Wait_On_Leader_Follower * const 
0x00000210, ACE_Time_Value * 0x00000000, TAO_Synch_Reply_Dispatcher & {...}) 
line 57 + 30 bytes
TAO_GIOP_Synch_Invocation::invoke_i(TAO_GIOP_Synch_Invocation * const 
0x00000210, unsigned char 0, CORBA_Environment & {...}) line 677 + 135 bytes
TAO_GIOP_Twoway_Invocation::invoke(TAO_GIOP_Twoway_Invocation * const 
0x00000210, TAO_Exception_Data * 0x00413250, unsigned int 1, CORBA_Environment 
& {...}) line 825
CLIENT! 00402e62()
CLIENT! 00403f53()
ACE_Thread_Adapter::invoke_i(ACE_Thread_Adapter * const 0x00000210) line 148 + 
3 bytes
ACE_Thread_Adapter::invoke(ACE_Thread_Adapter * const 0x00000210) line 91 + 7 
bytes
MSVCRT! 7800a3c0()
KERNEL

4 Other threads in wait_for_event
+++++++++++++++++++++++++++++++++
NTDLL! ZwWaitForSingleObject@12 + 11 bytes
KERNEL32! WaitForSingleObjectEx@12 + 90 bytes
KERNEL32! WaitForSingleObject@8 + 15 bytes
ACE_OS::cond_wait(ACE_cond_t * 0x00000000, _RTL_CRITICAL_SECTION * 0x00992c6c) 
line 6313 + 11 bytes
ACE_OS::cond_timedwait(ACE_cond_t * 0x009cbe6c, _RTL_CRITICAL_SECTION * 
0x00992c6c, ACE_Time_Value * 0x00000000) line 6205 + 11 bytes
ACE_Condition_Thread_Mutex::wait(ACE_Condition_Thread_Mutex * const 0x00000001, 
const ACE_Time_Value * 0x00000000) line 642 + 14 bytes
TAO_Leader_Follower::wait_for_event(TAO_Leader_Follower * const 0x00000001, 
TAO_LF_Event * 0x0112fac8, TAO_Transport * 0x009c7968, ACE_Time_Value * 
0x00000000) line 245 + 14 bytes
TAO_Wait_On_Leader_Follower::wait(TAO_Wait_On_Leader_Follower * const 
0x00000001, ACE_Time_Value * 0x00000000, TAO_Synch_Reply_Dispatcher & {...}) 
line 57 + 30 bytes
TAO_GIOP_Synch_Invocation::invoke_i(TAO_GIOP_Synch_Invocation * const 
0x00000001, unsigned char 0, CORBA_Environment & {...}) line 677 + 135 bytes
TAO_GIOP_Twoway_Invocation::invoke(TAO_GIOP_Twoway_Invocation * const 
0x00000001, TAO_Exception_Data * 0x00413250, unsigned int 1, CORBA_Environment 
& {...}) line 825
CLIENT! 00402e62()
CLIENT! 00403f53()
ACE_Thread_Adapter::invoke_i(ACE_Thread_Adapter * const 0x00000001) line 148 + 
3 bytes
ACE_Thread_Adapter::invoke(ACE_Thread_Adapter * const 0x00000001) line 91 + 7 
bytes

Comment 6 Nanbor Wang 2001-09-20 18:33:27 CDT
Hi Bala, if you have time to look at the testcase, could you also please 
monitor the memory usage. If you cntrl-c the client serveral times during the 
transfer of a large buffer, the memory on the server seems to grow without 
bound.
Comment 7 Nanbor Wang 2001-09-21 13:50:50 CDT
FWIW, I havent been able to reproduce this on a fast Linux box here. Just FYI.

Comment 8 Nanbor Wang 2001-09-24 17:23:21 CDT
Hi Bala,
	the patch below does not fix the problem. Please go ahead and defer the 
bug.

As for the memory growth, it appears that if I increase the number of threads 
on the server to 20, and run 4 clients all accessing 10MB buffers, the server 
quickly grows to 250MB and spends most of its time swapping virtual memory in 
and out. It seems that you are keeping buffers around on a thread specific 
basis and never clearing these buffers out after the transfer completes. 
Probably the next time that thread is used, the buffer is freed and replaced 
with a new buffer, so cumulative growth is not seen. Is this possible?

Anyway, neither bug is a show-stopper. We can break up our large buffer 
requests into smaller ones if need be.


Jonathan 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Patch supplied by Bala
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
int
TAO_IIOP_Transport::send_message (TAO_OutputCDR &stream,
                                  TAO_Stub *stub,
                                  int twoway,
                                  ACE_Time_Value *max_wait_time)
{
  // Format the message in the stream first
  if (this->messaging_object_->format_message (stream) != 0)
    return -1;

  // This guarantees to send all data (bytes) or return an error.
  ssize_t n = this->send_message_i (stub,
                                    twoway,
                                    stream.begin (),
                                    max_wait_time);

  if (n == -1)
    {
      if (TAO_debug_level)
        ACE_DEBUG ((LM_DEBUG,
                    ACE_TEXT ("TAO: (%P|%t|%N|%l) closing transport %d after 
fault %p\n"),
                    this->id (),
                    ACE_TEXT ("send_message ()\n")));

      this->tms_->connection_closed ();

      return -1;
    }

  return 1;
}
Comment 9 Nanbor Wang 2001-09-24 17:50:04 CDT
I have been able to reproduce the memory growth problem. I kind of know where
the other problem that Jon actually reported  could be stemming from.  I
accepting the bug.
Comment 10 Nanbor Wang 2001-09-25 16:00:28 CDT
Assigning the bug to me
Comment 11 Nanbor Wang 2001-09-25 16:01:15 CDT
Accepting the bug.
Comment 12 Nanbor Wang 2001-09-26 09:19:27 CDT
Here is my analysis of why there is a memory hog. Below is the extract from my
email to folks like Chris (C), Chad and Carlos. 

bala@cs.wustl.edu writes:
--------------------
I think we have a semi-serious problem on hand. This gives rise to
memory growth problems and it can be serious. This can also lead to
weird behaviours which I am not ready to think about.

Here is how it occurs. Imagine a a server with two threads and a
client with say 10 threads. The server as a part of the reply needs to
send large data, say 10 MB. Below is a sequence of steps

1. The connections are muxed. So there may be more than one message in
   a single read for a thread

2. As a result of the above we send notifies to the reactor, which
   unblocks the next thread and sends it to the same handler. 

3. At step #2 the number of upcalls on the handler is 3. As there are
   two threads in the handler this is fine. 

4. The first thread tries to send a 10 MB reply. It is not able to, and
   so schedules output and waits on the reactor.

5. The same happens for thread 2. Note at this point the number of
   upcalls havent dropped as the handle_input () is not complete for
   both the threads

6. As the threads have gone to the reactor, the reactor can
   potentially wake them up and send to the same handler as there
   could be more events.

7. Now the upcall goes to 5 and both the threads go to the reactor
   again.  

8. After sometime, if the client crashes, we just call handle_close ()
   which decrements the upcall count by 1. If the upcall count is not
   0 we dont do anything. We expect all the threads (only 2 in our
   case) to unwind and decrement the upcall count.

9. That just doesnt happen and there is a leak. Worse still -- the
   transports are in cache.

10. The above could happen for >=1 connection in the above setup.

11. The worst part would start occuring if the servers were really
    making remote calls back to the client. They could try to call
    clients that are not available and possibly crash.

I have been able to reproduce whatever I have said (from 1-10). Any
ideas? The memory hog comes from the lingering transports, the messages queued
on them and its own memory. 

-------End mail-------------------------------------
I dont think we have time before 1.2/1.2.1 to fix these.  We need to look into
these once we start working on connections again, may be when GIOP 1.3 comes out
or when we work on the reactor or anything connected to  this area.

Comment 13 Nanbor Wang 2001-09-26 10:22:12 CDT
Hi Bala,
       is setting the -ORBFlushingStrategy to 'blocking' a workaround for the 
memory leak? If so, what behavior difference would I see from my server? 

Jonathan
Comment 14 Nanbor Wang 2001-09-26 12:49:37 CDT
I have been able to detect the source of the deadlock problem that Jon reported
(though I havent been to reproduce it). Here is the text  from a few emails that
I sent to Chris (C), Chad & Carlos.

-------------- Emails -------------------------
Hi,
The following pieces of code smells foul to me 

void
TAO_ServerRequest::tao_send_reply (void)
{
  int result = this->transport_->send_message (*this->outgoing_);
  if (result == -1)
    {
      if (TAO_debug_level > 0)
        {
          // No exception but some kind of error, yet a response
          // is required.
          ACE_ERROR ((LM_ERROR,
                      ACE_TEXT ("TAO: (%P|%t) %p: cannot send reply\n"),
                      ACE_TEXT ("TAO_ServerRequest::tao_send_reply")));

        }
    }
}

The variations of this methods like tao_send_* also smell fishy. The
problem is that if the send fails we dont do anything. The client
thread that waits for the reply keeps waiting for the reply. This
could lead to deadlock for a TAO client as he would be expecting
a reply. 

IMHO, this is bad. 

------------- In another email that followed  I wrote -----------------
Looks like doing the right thing may not be that easy. (else I would have done
that by now :-)). Ok, here is the trail of thought on this and I need you guys
to let me know whether I am  thinking in the wrong direction.  

What we need to do above is to close the connection ie. the socket
should be closed, at the very least, so that the clients thread is
not waiting for a reply. But looks like that is not easy. We have to
clean up the cache and also remove the entry from the reactor
map. We need to call a remove_handler () to get it off the map. But
the handle_close () that would be called will have zero effect as the
connection handler may not be closed (there could have been multiple
threads that passed thro the handler). So we want the connection
handler and transport to be alive (so that all the threads can exit 
safely), but the socket closed (so that the client can detect
connection closure) and also make sure that we dont close the socket
again when we delete the connection handler (basically we need to
split closing of the underlying socket and the deletion of the
connection handler object, which seems to go against the abstraction
of the Event/Svc Handler). Added to this we need to make sure that
lock the closing of the sockets as we dont want another thread reading
from it when the socket is closed.  We then need to drain the queues
in the transport when all the threads exit so that we prevent any
memory hogs (this is being done now and so not a problem).

Let me know where my understanding of this is wrong or if the demands
that I have placed is all wrong.

------End emails -----------------------------------------

The end result is that we do have a problem and we need to decide on a
corrective action. When an how is open? 


I am adding a dependency on 1031. May be we should see whether we can cover this
up in 1031.
Comment 15 Nanbor Wang 2001-09-26 16:09:49 CDT
An additional problem that Bala has asked me to log along with this bug is 
that -ORBFlushingStrategy blocking does not seem to work properly. This is 
based on conjecture from observations seen with the memory leaks. 

The memory that is being leaked is from copy of the buffer that the TP_Reactor 
makes as when the socket becomes blocked and the thread is freed up to go 
service other things. If I set the -ORBFlushingStrategy to blocking, I would 
expect that a buffer copy would NOT occur since the thread blocks on the socket 
and is not allowed to return to the pool until the data is successful sent to 
the client. Therefore, I should not see the memory leak reported in this bug if 
I run in that mode, but I do.

I have included a testcase that creates a global buffer (buffer_) that is 
reused by all request to the client using: 

return new File::Descriptor::DataBuffer (num_bytes,
                                         num_bytes,
                                         buffer_,
                                         0);

This testcase is helpfull in reproducing this bug and the original bug that 
initiated this bug report. The new testcase seems to produce the bug faster 
than the origninal one.
Comment 16 Nanbor Wang 2001-09-26 16:10:39 CDT
Created attachment 83 [details]
Testcase that creates the problem faster (at least on W2K)
Comment 17 Nanbor Wang 2001-09-26 16:34:10 CDT
Bugs relating to the 'Memory Hog' issue and '-ORBFlushingStrategy blocking' 
have been moved from here to bug 1025. This bug should only be updated to 
reflect changes to the "Deadlock on client after socket failure due to 
isufficient buffer space" issue. Thanks.
Comment 18 Nanbor Wang 2001-09-27 04:40:57 CDT
One of the recommendations from an older Microsoft Knowledge Base Article 
(Q201213), for a different but related large buffer sending problem on OS (NT 
4.0), is to: "Set the SO_SNDBUF socket option to 0 (zero) to allow the stack to 
send from your application buffer directly." 

This may work for this problem also, but I don't know what the tradeoff of 
having the OS use the buffer from the application directly would be. Off hand, 
it sounds like a good thing for large buffers and how the TP_Reactor works???

I'm going to modify my local version of 
TAO_Connection_Handler::set_socket_option to see if this makes any difference.
Comment 19 Nanbor Wang 2001-09-27 10:51:32 CDT
I tried this:

"Set the SO_SNDBUF socket option to 0 (zero) to allow the stack to 
send from your application buffer directly." 

and still produced the same "system lacked sufficient resources..." error.

BTW, I had to make a slight modification to the ORB_Core.cpp file to use -
ORBSndSock 0. This is because of the check:

  if (snd_sock_size != 0)
    this->orb_params ()->sock_sndbuf_size (snd_sock_size);

which should probably be checking for -1, since 0 has a specific meaning, ie 
use the application buffer space and not the OS's.

So it's still a mystery to me why the OS freaks. I'll try and investigate it 
further.

Comment 20 Johnny Willemsen 2007-09-20 03:08:58 CDT
to pool