Bug 1361

Summary: TP_Reactor -- Closed handle resumed leading to crashes
Product: ACE Reporter: Nanbor Wang <bala>
Component: ACE CoreAssignee: DOC Center Support List (internal) <tao-support>
Status: RESOLVED FIXED    
Severity: normal CC: steve.vranyes
Priority: P3    
Version: 5.2.5   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 1277    
Attachments: Patches

Description Nanbor Wang 2002-11-12 06:23:21 CST
This one just took 2.5 days which gave me boils in my eyes (I mean literally :-
)). 

Anyway, here is the story. TP_Reactor suspends handle before dispatching. Once 
the upcall returns it resumes the handle. Application like TAO, takes the 
responsibility of resuming the handle. 

During dispatch, if a thread detects anything wrong, it returns a -1 to the 
Reactor. The Reactor invokes a remove_handler () on itself, effectively 
removing the handle from its internal map. During the process of remove_hander 
() the event handler associated with it could be destroyed and the handle 
closed. All this is fine and C++NPV2 stuff. 

The crux of the problem. Sometimes after dispatching, closing handles and 
deleting the handlers, the suspended handler looked like it got resumed and 
created bad crashes. Here is the sequence of steps that occurs when I tested it 
with TAO in a heavily multi-threaded scenario

- A handler is destroyed and the handle closed.
- As soon as the handle is closed, the handle is allocated to another thread 
trying to establish connection to a remote host. Say thread 1, closed handle 10 
and thread 2 is trying to establish connection with a remote host. Handle 10, 
is allocated to thread 2 by the kernel. 
- Thread 2 registers handle 10 with the reactor and the reactor dispatches even 
a thread to the handler for handle 10. Naturally handle 10 is suspended.
- At this point, thread 1 wakes up. It takes it upon itself the duty of 
resuming the handle  and does so. Everything goes wrong from here. 

Yes, we have some checks that prevent thread 1 from resuming the handle 
registered by someone else. What are the checks that we have?

- We just check whether the handler used by thread 1 is the same as that it is 
going to resume ie. we do a find () on the internal map, take the event handler 
and do a pointer check. This works for most of the cases in TAO. 

BUT, if the application registers the same handler for every handle? ;-). That 
is what TAO does for nonblocking connects. We register the same Connector for 
every handle on which TAO waits for a connection. In fact this is how ACE's 
nonblocking connects work. The Connector maintains an internal map having 
associations between the actual event handlers and the handle (AST's). On 
connection completion the actual handlers are registered with the Reactor for 
that handle in which connects were completed. 

When above situations occurs in a MT-Threaded scenario, two threads see the 
handle and try dispatching (and closing handlers) to it leading to crashes. The 
stack trace were of the following nature

--------------------- Cut Here-----------------------------------
#0  0x40bfc781 in kill () from /lib/libc.so.6
#1  0x40b4be5e in pthread_kill () from /lib/libpthread.so.0
#2  0x40b4c339 in raise () from /lib/libpthread.so.0
#3  0x40bfdbe1 in abort () from /lib/libc.so.6
#4  0x404b5258 in ACE_OS::abort ()
    at /project/charanga/bala/ACE_wrappers/ace/OS.i:1045
#5  0x4098af0a in ACE_Log_Msg::log (this=0x80c6c70, format_str=0x406371da "",
    log_priority=LM_ERROR, argp=0xbefff610) at Log_Msg.cpp:1625
#6  0x40989cf7 in ACE_Log_Msg::log (this=0x80c6c70, log_priority=LM_ERROR,
    format_str=0x406371a0 "ACE_ASSERT: file %N, line %l assertion failed for '%
s'.%a\n") at Log_Msg.cpp:862
#7  0x404be6e5 in TAO_Connection_Handler::close_connection_eh (this=0x81056e0,
    eh=0x8105680) at Connection_Handler.cpp:386
#8  0x404e0098 in TAO_IIOP_Connection_Handler::close_connection (
    this=0x8105680) at IIOP_Connection_Handler.cpp:201
#9  0x404bdde0 in TAO_Connection_Handler::handle_close_eh (this=0x81056e0,
    handle=-1, reactor_mask=511, eh=0x8105680) at Connection_Handler.cpp:216
#10 0x404e0147 in TAO_IIOP_Connection_Handler::handle_close (this=0x8105680,
    handle=-1, rm=511) at IIOP_Connection_Handler.cpp:220
#11 0x404d00d5 in ACE_Svc_Handler<ACE_SOCK_Stream, ACE_NULL_SYNCH>::close (
    this=0x8105680)
    at /project/charanga/bala/ACE_wrappers/ace/Svc_Handler.cpp:309
#12 0x404d4d9c in ACE_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector>
::handle_output (this=0x80e227c, handle=10)
    at /project/charanga/bala/ACE_wrappers/ace/Connector.cpp:365
#13 0x409d6ab3 in ACE_TP_Reactor::dispatch_socket_event (this=0x80addd8,
    dispatch_info=@0xbefff83c) at TP_Reactor.cpp:796
#14 0x409d62e9 in ACE_TP_Reactor::handle_socket_events (this=0x80addd8,
    event_count=@0xbefff880, guard=@0xbefff8bc) at TP_Reactor.cpp:569
#15 0x409d5fb9 in ACE_TP_Reactor::dispatch_i (this=0x80addd8,
    max_wait_time=0x0, guard=@0xbefff8bc) at TP_Reactor.cpp:397
#16 0x409d5475 in ACE_TP_Reactor::handle_events (this=0x80addd8,
    max_wait_time=0x0) at TP_Reactor.cpp:174
#17 0x409d10cb in ACE_Reactor::handle_events (this=0x80addc8,
    max_wait_time=0x0) at /project/charanga/bala/ACE_wrappers/ace/Reactor.i:172
#18 0x405a5a34 in TAO_ORB_Core::run (this=0x809d2c0, tv=0x0, perform_work=0)
    at ORB_Core.cpp:1731
#19 0x40530618 in CORBA_ORB::run (this=0x80af5b0, tv=0x0) at ORB.cpp:237
#20 0x4053059e in CORBA_ORB::run (this=0x80af5b0) at ORB.cpp:221
#21 0x0805e1b0 in ORB_Task::svc (this=0xbffff668) at ORB_Task.cpp:24
#22 0x40a312df in ACE_Task_Base::svc_run (args=0xbffff668) at Task.cpp:203
#23 0x409a979c in ACE_Thread_Adapter::invoke_i (this=0x80c6c40)
    at Thread_Adapter.cpp:150
#24 0x409a968e in ACE_Thread_Adapter::invoke (this=0x80c6c40)
    at Thread_Adapter.cpp:93
#25 0x4094ec06 in ace_thread_adapter (args=0x80c6c40)
#26 0x40b490ba in pthread_start_thread () from /lib/libpthread.so.0

--------------------- Cut Here -------------------------------------
The crashes usually stemmed from the Connector.cpp. 

Have a test case provided by Steve Vranyes from Veritas that reproduces the 
problem with unerring accuracy. The problem is reproduceable consistently only 
after applying patches from bug 1020. 

I have a patch that prevent thread 1 from doing the resumption. May not be a 
good solution but it was a quick hack for me to verify the actual problem. They 
will be attached shortly.
Comment 1 Nanbor Wang 2002-11-12 06:49:24 CST
Created attachment 161 [details]
Patches
Comment 2 Nanbor Wang 2002-11-12 06:55:49 CST
Patches in 
http://deuce.doc.wustl.edu/bugzilla/showattachment.cgi?attach_id=161
help TAO a lot. ACE should also benefit. But the following use case for ACE 
will choke 

- A dispatch, say handle_input () is called on a handler
- The call retuned -1 and the remove_handler () is expected to remove only the 
READ_MASK for that handle without destroying the handler.

For the above case the handle will not be resumed. Not sure how to come up with 
a more cleaner fix. 

For TAO, this isnt a problem since remove_handler () invariably removes all the 
masks from the reactor. Further, TAO resumes the handler on its own and hence 
this problem wouldnt show up.
Comment 3 Nanbor Wang 2002-11-12 07:51:59 CST
Forgot to add this. The above patches will not help an ACE application if the 
application decides to call remove_handler () from another thread. 
Comment 4 Nanbor Wang 2002-11-12 14:27:08 CST
Added the patch to the main trunk.Here is the relevant ChangeLog entry

Tue Nov 12 14:12:59 2002  Balachandran Natarajan 
<bala@isis-server.isis.vanderbilt.edu>

I am leaving the bug wide open for future reference.
Comment 5 Nanbor Wang 2002-11-12 14:27:31 CST
Accept it for tao-support
Comment 6 Nanbor Wang 2002-11-30 22:40:28 CST
This should now hopefully be fixed. Here is the relevant ChangeLog entry

Wed Nov 27 22:38:15 2002  Balachandran Natarajan  <bala@isis-server.isis.vanderb
ilt.edu>