Please report new issues athttps://github.com/DOCGroup
This one just took 2.5 days which gave me boils in my eyes (I mean literally :- )). Anyway, here is the story. TP_Reactor suspends handle before dispatching. Once the upcall returns it resumes the handle. Application like TAO, takes the responsibility of resuming the handle. During dispatch, if a thread detects anything wrong, it returns a -1 to the Reactor. The Reactor invokes a remove_handler () on itself, effectively removing the handle from its internal map. During the process of remove_hander () the event handler associated with it could be destroyed and the handle closed. All this is fine and C++NPV2 stuff. The crux of the problem. Sometimes after dispatching, closing handles and deleting the handlers, the suspended handler looked like it got resumed and created bad crashes. Here is the sequence of steps that occurs when I tested it with TAO in a heavily multi-threaded scenario - A handler is destroyed and the handle closed. - As soon as the handle is closed, the handle is allocated to another thread trying to establish connection to a remote host. Say thread 1, closed handle 10 and thread 2 is trying to establish connection with a remote host. Handle 10, is allocated to thread 2 by the kernel. - Thread 2 registers handle 10 with the reactor and the reactor dispatches even a thread to the handler for handle 10. Naturally handle 10 is suspended. - At this point, thread 1 wakes up. It takes it upon itself the duty of resuming the handle and does so. Everything goes wrong from here. Yes, we have some checks that prevent thread 1 from resuming the handle registered by someone else. What are the checks that we have? - We just check whether the handler used by thread 1 is the same as that it is going to resume ie. we do a find () on the internal map, take the event handler and do a pointer check. This works for most of the cases in TAO. BUT, if the application registers the same handler for every handle? ;-). That is what TAO does for nonblocking connects. We register the same Connector for every handle on which TAO waits for a connection. In fact this is how ACE's nonblocking connects work. The Connector maintains an internal map having associations between the actual event handlers and the handle (AST's). On connection completion the actual handlers are registered with the Reactor for that handle in which connects were completed. When above situations occurs in a MT-Threaded scenario, two threads see the handle and try dispatching (and closing handlers) to it leading to crashes. The stack trace were of the following nature --------------------- Cut Here----------------------------------- #0 0x40bfc781 in kill () from /lib/libc.so.6 #1 0x40b4be5e in pthread_kill () from /lib/libpthread.so.0 #2 0x40b4c339 in raise () from /lib/libpthread.so.0 #3 0x40bfdbe1 in abort () from /lib/libc.so.6 #4 0x404b5258 in ACE_OS::abort () at /project/charanga/bala/ACE_wrappers/ace/OS.i:1045 #5 0x4098af0a in ACE_Log_Msg::log (this=0x80c6c70, format_str=0x406371da "", log_priority=LM_ERROR, argp=0xbefff610) at Log_Msg.cpp:1625 #6 0x40989cf7 in ACE_Log_Msg::log (this=0x80c6c70, log_priority=LM_ERROR, format_str=0x406371a0 "ACE_ASSERT: file %N, line %l assertion failed for '% s'.%a\n") at Log_Msg.cpp:862 #7 0x404be6e5 in TAO_Connection_Handler::close_connection_eh (this=0x81056e0, eh=0x8105680) at Connection_Handler.cpp:386 #8 0x404e0098 in TAO_IIOP_Connection_Handler::close_connection ( this=0x8105680) at IIOP_Connection_Handler.cpp:201 #9 0x404bdde0 in TAO_Connection_Handler::handle_close_eh (this=0x81056e0, handle=-1, reactor_mask=511, eh=0x8105680) at Connection_Handler.cpp:216 #10 0x404e0147 in TAO_IIOP_Connection_Handler::handle_close (this=0x8105680, handle=-1, rm=511) at IIOP_Connection_Handler.cpp:220 #11 0x404d00d5 in ACE_Svc_Handler<ACE_SOCK_Stream, ACE_NULL_SYNCH>::close ( this=0x8105680) at /project/charanga/bala/ACE_wrappers/ace/Svc_Handler.cpp:309 #12 0x404d4d9c in ACE_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector> ::handle_output (this=0x80e227c, handle=10) at /project/charanga/bala/ACE_wrappers/ace/Connector.cpp:365 #13 0x409d6ab3 in ACE_TP_Reactor::dispatch_socket_event (this=0x80addd8, dispatch_info=@0xbefff83c) at TP_Reactor.cpp:796 #14 0x409d62e9 in ACE_TP_Reactor::handle_socket_events (this=0x80addd8, event_count=@0xbefff880, guard=@0xbefff8bc) at TP_Reactor.cpp:569 #15 0x409d5fb9 in ACE_TP_Reactor::dispatch_i (this=0x80addd8, max_wait_time=0x0, guard=@0xbefff8bc) at TP_Reactor.cpp:397 #16 0x409d5475 in ACE_TP_Reactor::handle_events (this=0x80addd8, max_wait_time=0x0) at TP_Reactor.cpp:174 #17 0x409d10cb in ACE_Reactor::handle_events (this=0x80addc8, max_wait_time=0x0) at /project/charanga/bala/ACE_wrappers/ace/Reactor.i:172 #18 0x405a5a34 in TAO_ORB_Core::run (this=0x809d2c0, tv=0x0, perform_work=0) at ORB_Core.cpp:1731 #19 0x40530618 in CORBA_ORB::run (this=0x80af5b0, tv=0x0) at ORB.cpp:237 #20 0x4053059e in CORBA_ORB::run (this=0x80af5b0) at ORB.cpp:221 #21 0x0805e1b0 in ORB_Task::svc (this=0xbffff668) at ORB_Task.cpp:24 #22 0x40a312df in ACE_Task_Base::svc_run (args=0xbffff668) at Task.cpp:203 #23 0x409a979c in ACE_Thread_Adapter::invoke_i (this=0x80c6c40) at Thread_Adapter.cpp:150 #24 0x409a968e in ACE_Thread_Adapter::invoke (this=0x80c6c40) at Thread_Adapter.cpp:93 #25 0x4094ec06 in ace_thread_adapter (args=0x80c6c40) #26 0x40b490ba in pthread_start_thread () from /lib/libpthread.so.0 --------------------- Cut Here ------------------------------------- The crashes usually stemmed from the Connector.cpp. Have a test case provided by Steve Vranyes from Veritas that reproduces the problem with unerring accuracy. The problem is reproduceable consistently only after applying patches from bug 1020. I have a patch that prevent thread 1 from doing the resumption. May not be a good solution but it was a quick hack for me to verify the actual problem. They will be attached shortly.
Created attachment 161 [details] Patches
Patches in http://deuce.doc.wustl.edu/bugzilla/showattachment.cgi?attach_id=161 help TAO a lot. ACE should also benefit. But the following use case for ACE will choke - A dispatch, say handle_input () is called on a handler - The call retuned -1 and the remove_handler () is expected to remove only the READ_MASK for that handle without destroying the handler. For the above case the handle will not be resumed. Not sure how to come up with a more cleaner fix. For TAO, this isnt a problem since remove_handler () invariably removes all the masks from the reactor. Further, TAO resumes the handler on its own and hence this problem wouldnt show up.
Forgot to add this. The above patches will not help an ACE application if the application decides to call remove_handler () from another thread.
Added the patch to the main trunk.Here is the relevant ChangeLog entry Tue Nov 12 14:12:59 2002 Balachandran Natarajan <bala@isis-server.isis.vanderbilt.edu> I am leaving the bug wide open for future reference.
Accept it for tao-support
This should now hopefully be fixed. Here is the relevant ChangeLog entry Wed Nov 27 22:38:15 2002 Balachandran Natarajan <bala@isis-server.isis.vanderb ilt.edu>