Please report new issues athttps://github.com/DOCGroup
There are serious bugs in the connection handling code (perhaps related to Bug 2654). These bugs (or very similar bugs) have been around for quite a while (i.e., they are NOT new to TAO 1.6.x). WE BELIEVE THESE PATCHES TREAT THE SYMPTOM(S), NOT THE DISEASE. However, in testing for our Beta.4-6.0.0 release of the TENA Middleware, we finally reached the state where we don't crash anymore. So that's a "fix", after a fashion. Bugzilla 3658: 04-close_connection_eh-called-multiple-times.patch Bugzilla 3659: 04-do-not-purge-entry-if-entry-is-null.patch Bugzilla 3660: 04-do-not-return-unconnected-transport.patch Without these patches, we can produce a crash with probability 0.98 in the time it takes to start 4 applications on 4 different computers. We haven't produced a TAO-only test I'm afraid ... so, we'll leave that as an exercise for a student! What we do with TENA to create a crash (without those patches) is to start an executionManager application (which has an RTEC in it). Then start a couple of TENA applications (which produce events) and a TENA adminConsole application (which is consumes events). All four applications run on separate computers (but I doubt that's a strict requirement to produce the crash). If I simultaneously Control-C one of the TENA applications and the TENA adminConsole ... the executionManager crashes. What happens is that the surviving TENA application sends "alert" event(s) (can't talk to the dead app) to the RTEC in the executionManager. The RTEC in the executionManager tries to push the event(s) (with SYNC_TO_SERVER) to the TENA adminConsole. The socket to the adminConsole breaks (or was already closed). Regardless the executionManager tries to the push event(s) multiple times ... which results in an attempt to open a new connection ... which is destined to fail ... but apparently a different thread handles the connection failure and cleans up the connection in the transport that was being used. Boom! In addition to, a crash (which we presume is directly related to the crashes we saw in our testing with the TENA Middleware) can be demonstrated using an unpatched TAO 1.6.9 with Bug_2654_Regression test. The steps to reproduce are straightforward. I downloaded a stock TAO 1.6.9 distribution, built it on a f10-gcc43 system (including the tests), and ran the TAO/tests/Bug_2654_Regression test in a loop from a bash script 1000 times. The failure rate was 43 out of 1000 runs. Here's a stack trace from the client app for one of the tests... #0 TAO::Invocation_Adapter::invoke_remote_i (this=0xb36fd284, stub=0x81d5850, details=@0xb36fd1e8, effective_target=@0xb36fd198, max_wait_time=@0xb36fd194) at Invocation_Adapter.cpp:257 #1 0x0039496d in TAO::Invocation_Adapter::invoke_i (this=0xb36fd284, stub=0x81d5850, details=@0xb36fd1e8) at Invocation_Adapter.cpp:91 #2 0x00394118 in TAO::Invocation_Adapter::invoke (this=0xb36fd284, ex_data=0x0, ex_count=0) at Invocation_Adapter.cpp:50 #3 0x0804ffa6 in Test::Hello::method (this=<value optimized out>, count=203) at HelloC.cpp:362 #4 0x08050fe4 in Worker::svc (this=0xbfbed23c) at client.cpp:97 #5 0x0055c212 in ACE_Task_Base::svc_run (args=0xbfbed23c) at Task.cpp:275 #6 0x0055d6ad in ACE_Thread_Adapter::invoke_i (this=0xb4b00770) at Thread_Adapter.cpp:149 #7 0x0055d726 in ACE_Thread_Adapter::invoke (this=0xb4b00770) at Thread_Adapter.cpp:98 #8 0x004e6501 in ace_thread_adapter (args=0xb4b00770) at Base_Thread_Adapter.cpp:124 #9 0x009b751f in start_thread () from /lib/libpthread.so.0 #10 0x008ed04e in clone () from /lib/libc.so.6 Two different threads were stopped at the same point operating on the same object, so this may be a thread safety issue. In another run, the client stopped here... #0 0x0037572b in ~auto_ptr () at /usr/lib/gcc/i386-redhat-linux/4.3.2/../../../../include/c++/4.3.2/backward/auto_ptr.h:173 #1 ~TAO_GIOP_Message_Base (this=0xae70d3c8) at GIOP_Message_Base.cpp:64 #2 0x003f4fa1 in ~TAO_Transport (this=0xae70d308) at Transport.cpp:204 #3 0x00391f8d in ~TAO_IIOP_Transport (this=0xae70d308) at IIOP_Transport.cpp:36 #4 0x0038633b in ~TAO_IIOP_Connection_Handler (this=0xae709238) at IIOP_Connection_Handler.cpp:96 #5 0x00504bf8 in ACE_Event_Handler::remove_reference (this=0xae709238) at Event_Handler.cpp:210 #6 0x003f1699 in TAO_Transport::remove_reference (this=0xae70d308) at Transport.cpp:2622 #7 0x00362b03 in TAO_Connection_Handler::close_handler (this=0xae7092a4, flags=0) at Connection_Handler.cpp:456 #8 0x0038541c in TAO_IIOP_Connection_Handler::close (this=0xae709238, flags=0) at IIOP_Connection_Handler.cpp:449 #9 0x00389f18 in ACE_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector>::initialize_svc_handler (this=0x8823e40, handle=12, svc_handler=0xae709238) at /home/jseward/TAO/1.6.9/ACE_wrappers/ace/Connector.cpp:622 #10 0x0038ac56 in ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_output (this=0xae708f98, handle=12) at /home/jseward/TAO/1.6.9/ACE_wrappers/ace/Connector.cpp:165 #11 0x005671bc in ACE_TP_Reactor::dispatch_socket_event (this=0x8822730, dispatch_info=@0xb56fd170) at TP_Reactor.cpp:575 #12 0x00567bb4 in ACE_TP_Reactor::handle_socket_events (this=0x8822730, event_count=@0xb56fd1c8, guard=@0xb56fd214) at TP_Reactor.cpp:445 #13 0x00567ca8 in ACE_TP_Reactor::dispatch_i (this=0x8822730, max_wait_time=0x0, guard=@0xb56fd214) at TP_Reactor.cpp:244 #14 0x00567d81 in ACE_TP_Reactor::handle_events (this=0x8822730, max_wait_time=0x0) at TP_Reactor.cpp:173 #15 0x003b6453 in ACE_Reactor::handle_events () at /home/jseward/TAO/1.6.9/ACE_wrappers/ace/Reactor.inl:188 #16 TAO_ORB_Core::run (this=0x8820dd8, tv=0x0, perform_work=0) at ORB_Core.cpp:2166 #17 0x003b0bdc in CORBA::ORB::run (this=0x88230c8, tv=0x0) at ORB.cpp:193 #18 0x003b0c45 in CORBA::ORB::run (this=0x88230c8) at ORB.cpp:179 #19 0x08050edd in Worker::svc (this=0xbfc7eadc) at client.cpp:80 #20 0x0055c212 in ACE_Task_Base::svc_run (args=0xbfc7eadc) at Task.cpp:275 #21 0x0055d6ad in ACE_Thread_Adapter::invoke_i (this=0xb6b00988) at Thread_Adapter.cpp:149 #22 0x0055d726 in ACE_Thread_Adapter::invoke (this=0xb6b00988) at Thread_Adapter.cpp:98 #23 0x004e6501 in ace_thread_adapter (args=0xb6b00988) at Base_Thread_Adapter.cpp:124 #24 0x009b751f in start_thread () from /lib/libpthread.so.0 #25 0x008ed04e in clone () from /lib/libc.so.6 This looks more like the problem(s) we were seeing without our patches (referenced above)
added dep
Performing the described test with TAO/tests/Bug_2654_regression for 1000 iterations as well as for 10000 iterations was unable to reproduce the described application crashes. Replaying the TENA scenario involving RTEC is not feasible currently. The referenced Bugzilla entries (3658, 3659, 3660) are handled and commented separately.