Please report new issues athttps://github.com/DOCGroup
Subscribers receive oneway method invokations at a high frequency. In case the subscriber dies the publisher does not notice it if SYNC_NONE is set for the ORB. After a few seconds the application crashes with a pure virtual function call in Transport.cpp when the subscriber is invoked again. My guess is that the ORB realizes after a few seconds that the transport to the subscriber is broken and destroys it. A workaround might be to query the refence if it is still alive. Example: The svc method of the worker thread of the publisher: subscriber->onData is oneway and subscriber->isAlive is not. The complete Exploit has been emailed to the tao-user mailing list. int Publisher_impl::Worker::svc (void) { double data = 0.0; unsigned long now = 0; unsigned long nextAliveCheck = GetTickCount()+30000; while (!terminated) { data += 0.01; { ACE_Guard<ACE_Process_Mutex> guard(mutex, 1, 1); for (vector<_Subscriber>::iterator iter = subscribers.begin(); iter != subscribers.end(); ++iter) { if (!iter->unsubscribed) { try { now = GetTickCount(); iter->subscriber->onData(data); } catch (...) { printf("Caught exception after %d msecs.\n", GetTickCount()-now); iter->unsubscribed = true; printf("Unsubscribed subscriber.\n"); } } } } if (nextAliveCheck < GetTickCount()) { ACE_Guard<ACE_Process_Mutex> guard(mutex, 1, 1); for (vector<_Subscriber>::iterator iter = subscribers.begin(); iter != subscribers.end(); ++iter) { if (!iter->unsubscribed) { try { now = GetTickCount(); iter->subscriber->isAlive(); } catch (...) { printf("Caught exception after %d msecs.\n", GetTickCount()-now); iter->unsubscribed = true; printf("Unsubscribed subscriber because isAlive failed.\n"); } } } nextAliveCheck = GetTickCount()+30000; } Sleep(10); } return 0; }
regresion is on my system and also gives the same error as reported. Running it in the debugger doesn't show the reason, there the program also just ends.
maybe a method is invoked on an object that is already destructed? Any reason why a process mutex is used instead of a thread mutex, the process mutex is really much heavier, is synchronisation between processes required?
I don't use ACE threads in my application, therefore I just copied some code for the mutex. There is no reason why I used a process mutex. Synchronization is necessary, since the vector may not change during the iterations. Therefore it has to be synchronized with adding subscribers.
I tried to narrow down the problem, the reason is, that in the destructor of TAO_IIOP_Transport the connection_handler_ (which is returned in event_handler_i()) is destructed. After the members of TAO_IIOP_Transport have been destroyed, the destructor of TAO_Transport is executed and in the method purge_entry the reference to the event_handler_i is removed, which has already been destructed (and also the virtual function event_handler_i()). Please have a look at the call stack: publisher.exe!_purecall() Line 53 + 0x7 C publisher.exe!TAO_Transport::remove_reference() Line 2424 + 0xd C++ publisher.exe!TAO::Cache_IntId::~Cache_IntId() Line 31 C++ [..snip..] publisher.exe!TAO_Transport::purge_entry() Line 394 C++ publisher.exe!TAO_Transport::~TAO_Transport() Line 181 C++ publisher.exe!TAO_IIOP_Transport::~TAO_IIOP_Transport() Line 36 + 0xf C++ publisher.exe!TAO_IIOP_Transport::`scalar deleting destructor'() + 0x16 C++ publisher.exe!TAO_IIOP_Connection_Handler::~TAO_IIOP_Connection_Handler() Line 59 + 0x28 C++ publisher.exe!TAO_IIOP_Connection_Handler::`scalar deleting destructor'() + 0x16 C++ I have created a simple work around by using a flag is_destructing_ which is set to true at the beginning of the TAO_Transport::~TAO_Transport destructor. In the methods TAO_Transport::add_reference and TAO_Transport::remove_reference the flag is checked and in case it is true, it returns 0 instead of invoking the method event_handler_i(). That works but is probably not the best thing to do. I can think of different alternatives: 1. Invokation of a oneway method fails with an exception, in case the transport is already terminated. 2. I can query the subscriber to check whether the transport is still valid. 3. Or is it more a question of cleaning up unfinished messages? Hope, this helps.
Problem seems to be from a thread that has an asynchronous connect outstanding which seems to get cancelled and then crashes things 005F3AD2 TAO_Transport::~TAO_Transport(this=:00DAA5EC) 005A49DA TAO_IIOP_Transport::~TAO_IIOP_Transport(this=:00DAA5EC) 0059BFDD TAO_IIOP_Connection_Handler::~TAO_IIOP_Connection_Handler(this=:00DAA52C) 00456CC3 ACE_Event_Handler::remove_reference(this=:00DAA52C) 005F7676 TAO_Transport::remove_reference(this=:00DAA5EC) 0056BF80 TAO_Connection_Handler::close_handler(this=:00DAA598) 0059C8A3 TAO_IIOP_Connection_Handler::close(this=:00DAA52C, =0) 0059EE1A ACE_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector>::initialize_svc_handler(this=:00DA9988, handle=:00000668, svc_handler=:00DAA52C) 005A0566 ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_output(this=:00DAAA64, handle=:00000668) 005A057F ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_exception(this=:00DAAA64, h=:00000668) 004C7390 ACE_TP_Reactor::dispatch_socket_event(this=:00D79454, dispatch_info=:0167FC94) 004C6FBF ACE_TP_Reactor::handle_socket_events(this=:00D79454, event_count=:0167FCD4, guard=:0167FD14) 004C6D01 ACE_TP_Reactor::dispatch_i(this=:00D79454, max_wait_time=NULL, guard=:0167FD14) 004C6B3D ACE_TP_Reactor::handle_events(this=:00D79454, max_wait_time=NULL) 0049FBA8 ACE_Reactor::handle_events(this=:00D79444, max_wait_time=NULL) 005C7D74 TAO_ORB_Core::run(this=:00D6B988, tv=NULL, perform_work=0)
the transport is created at the moment the server has detected the client is gone, then we close the transports, but we keep invoking methods, so we try to reconnect to the client again, because things are oneways we do an asynchronous connect, this later fails with the callstack in front of this message, we try to cleanup this transport we are still trying to connect on and this then crashes 005F386C TAO_Transport::TAO_Transport(this=:00DAA614, tag=0, orb_core=:00D6B988) 005A48C7 TAO_IIOP_Transport::TAO_IIOP_Transport(this=:00DAA614, handler=:00DAA554, orb_core=:00D6B988, =false) 0059BE6A TAO_IIOP_Connection_Handler::TAO_IIOP_Connection_Handler(this=:00DAA554, orb_core=:00D6B988, flag=false) 0059EC54 TAO_Connect_Creation_Strategy<TAO_IIOP_Connection_Handler>::make_svc_handler(this=:00DA9A38, sh=:0137F994) 005A00DA ACE_Strategy_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector>::make_svc_handler(this=:00DA9988, sh=:0137F994) 0059F87C ACE_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector>::connect_i(this=:00DA9988, sh=:0137F994, sh_copy=NULL, remote_addr=:00DA9B08, synch_options=:0137F94C, local_addr=:0137F9E4, reuse_addr=0, flags=2, perms=0) 0059EE8F ACE_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector>::connect(this=:00DA9988, sh=:0137F994, remote_addr=:00DA9B08, synch_options=:0137F94C, local_addr=:0137F9E4, reuse_addr=0, flags=2, perms=0) 0059DF46 TAO_IIOP_Connector::make_connection(this=:00DA9960, r=:0137FC04, desc=:0137FB30, timeout=:0137F99C) 005FB070 TAO_Connector::connect(this=:00DA9960, r=:0137FC04, desc=:0137FB30, timeout=NULL) 005D7BDF TAO::Profile_Transport_Resolver::try_connect(this=:0137FC04, desc=:0137FB30, max_time_value=NULL) 005ADDFB TAO_Default_Endpoint_Selector::select_endpoint(this=:00D72218, r=:0137FC04, max_wait_time=NULL) 005D79D8 TAO::Profile_Transport_Resolver::resolve(this=:0137FC04, max_time_val=NULL) 005ACF9A TAO::Invocation_Adapter::invoke_remote_i(this=:0137FD60, stub=:00DA9D24, details=:0137FCC0, effective_target=:0137FC74, max_wait_time=:0137FC48) 005ACC01 TAO::Invocation_Adapter::invoke_i(this=:0137FD60, stub=:00DA9D24, details=:0137FCC0) 005ACAAD TAO::Invocation_Adapter::invoke(this=:0137FD60, ex_data=NULL, ex_count=0) 00408CC6 Subscriber::onData(this=:00DA9E88, value=4.69999999999994) 00401884 Publisher_impl::Worker::svc(this=:00D9CAE0)
we have a double delete see the stacks of the following 2 threads 005A49C3 TAO_IIOP_Transport::~TAO_IIOP_Transport(this=:00DAAC70) 0059BFDD TAO_IIOP_Connection_Handler::~TAO_IIOP_Connection_Handler(this=:00DAABB0) 00456CC3 ACE_Event_Handler::remove_reference(this=:00DAABB0) 005F7676 TAO_Transport::remove_reference(this=:00DAAC70) 0056BF80 TAO_Connection_Handler::close_handler(this=:00DAAC1C) 0059C8A3 TAO_IIOP_Connection_Handler::close(this=:00DAABB0, =0) 0059EE1A ACE_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector>::initialize_svc_handler(this=:00DA9FE4, handle=:00000678, svc_handler=:00DAABB0) 005A0566 ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_output(this=:00DAB0E8, handle=:00000678) 005A057F ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_exception(this=:00DAB0E8, h=:00000678) 004C7390 ACE_TP_Reactor::dispatch_socket_event(this=:00D79454, dispatch_info=:0167FC94) 004C6FBF ACE_TP_Reactor::handle_socket_events(this=:00D79454, event_count=:0167FCD4, guard=:0167FD14) 004C6D01 ACE_TP_Reactor::dispatch_i(this=:00D79454, max_wait_time=NULL, guard=:0167FD14) 004C6B3D ACE_TP_Reactor::handle_events(this=:00D79454, max_wait_time=NULL) 0049FBA8 ACE_Reactor::handle_events(this=:00D79444, max_wait_time=NULL) 005C7D74 TAO_ORB_Core::run(this=:00D6B988, tv=NULL, perform_work=0) and 005A496C TAO_IIOP_Transport::~TAO_IIOP_Transport(this=:00DAAC70) 0059BFDD TAO_IIOP_Connection_Handler::~TAO_IIOP_Connection_Handler(this=:00DAABB0) 00456CC3 ACE_Event_Handler::remove_reference(this=:00DAABB0) 005F7676 TAO_Transport::remove_reference(this=:00DAAC70) 005D78F0 TAO::Profile_Transport_Resolver::~Profile_Transport_Resolver(this=:0137FC04) 005ACFFF TAO::Invocation_Adapter::invoke_remote_i(this=:0137FD60, stub=:00DAA380, details=:0137FCC0, effective_target=:0137FC74, max_wait_time=:0137FC48) 005ACC01 TAO::Invocation_Adapter::invoke_i(this=:0137FD60, stub=:00DAA380, details=:0137FCC0) 005ACAAD TAO::Invocation_Adapter::invoke(this=:0137FD60, ex_data=NULL, ex_count=0) 00408CC6 Subscriber::onData(this=:00DAA4E4, value=3.97999999999996)
update summary
When in the method below the remove_reference is removed things also work but I have my doubts about the correctness at this moment. This method reduces the refcount on the transport to 0 meaning it gets deleted, after that the profile transport resolvers gets the same transport int TAO_Connection_Handler::close_handler (void) { this->state_changed (TAO_LF_Event::LFS_CONNECTION_CLOSED, this->orb_core_->leader_follower ()); this->transport ()->remove_reference (); return 0; } 0056BF57 TAO_Connection_Handler::close_handler(this=:00DAA558) 0059C8C3 TAO_IIOP_Connection_Handler::close(this=:00DAA4EC, =0) 0059EE3A ACE_Connector<TAO_IIOP_Connection_Handler, ACE_SOCK_Connector>::initialize_svc_handler(this=:00DA9948, handle=:00000668, svc_handler=:00DAA4EC) 005A0586 ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_output(this=:00DAAA24, handle=:00000668) 005A059F ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_exception(this=:00DAAA24, h=:00000668) 004C7390 ACE_TP_Reactor::dispatch_socket_event(this=:00D79414, dispatch_info=:0187FC94) 004C6FBF ACE_TP_Reactor::handle_socket_events(this=:00D79414, event_count=:0187FCD4, guard=:0187FD14) 004C6D01 ACE_TP_Reactor::dispatch_i(this=:00D79414, max_wait_time=NULL, guard=:0187FD14) 004C6B3D ACE_TP_Reactor::handle_events(this=:00D79414, max_wait_time=NULL) 0049FBA8 ACE_Reactor::handle_events(this=:00D79404, max_wait_time=NULL) 005C7D94 TAO_ORB_Core::run(this=:00D6B948, tv=NULL, perform_work=0)
went back in cvs, handle_close got introduced by the checkin, strange that it says no refcount is needed on the transport but that handle_close does decremement it. So far as I now can tell this should be gone, but have to double check that with others. Mon Jul 07 18:00:38 2003 Irfan Pyarali <irfan@oomworks.com> * tao/IIOP_Connection_Handler: - Added close() method that will be called by the Acceptor or Connector when connection establishment fails. - No need to reference count the transport. Handlers and transports now have the same lifespan. Transport in created in the constructor and deleted in the destructor. - handle_input() and handle_output() intercept any errors from handle_input_eh() and handle_output_eh() respectively and call close_connection(). We no longer rely on handle_close(). - Assert that handle_close() no longer gets called. - Added handle_timeout(). We don't use this upcall for I/O. This is only used by the Connector to indicate that the connection timedout. Therefore, we should call close(). - The unused constructor should assert(0).
In TAO_IIOP_Connector::make_connection we end up with a transport that is coming from the cache, but it is closed from below already. The refcount is then 1, then up in the callstack the profile_transport_resolver let the refcount go to 0 making the transport gets destructed at that moment.
New proposed fix. Seems the transport connector is the best place to check if we really got a correct transport back. The exact if should be reviewed in detail, more testing is necessary for this, especially the queueing behaviour of the sync_none should be retested. =================================================================== RCS file: /project/cvs-repository/ACE_wrappers-repository/TAO/tao/Transport_Connector.cpp,v retrieving revision 1.28 diff -u -u -w -b -r1.28 Transport_Connector.cpp --- Transport_Connector.cpp 2 Nov 2005 11:03:27 -0000 1.28 +++ Transport_Connector.cpp 28 Feb 2006 16:01:32 -0000 @@ -399,6 +399,29 @@ if (result == -1) { + // We are waiting for a non blocking connection + // but our connection handler is not connecting anymore, + // force close the connection + if (!r->blocked_connect () && + !transport->connection_handler ()->is_connecting()) + { + if (TAO_debug_level > 2) + ACE_DEBUG ((LM_DEBUG, + "TAO (%P|%t) - Transport_Connector::" + "wait_for_connection_completion, " + "waiting for a non blocking connection but our, " + "connection handler is not connecting anymore.\n")); + + // Forget the return value. We are busted anyway. Try our best + // here. + (void) this->cancel_svc_handler (transport->connection_handler ()); + + // Set transport to zero, it is not usable + transport = 0; + + return false; + } + if (!r->blocked_connect () && errno == ETIME) { // If we did a non blocking connect, just ignore
regression is in cvs Mon Mar 06 15:13:12 2006 Johnny Willemsen <jwillemsen@remedy.nl> * tests/Bug_2417_Regression/*: New regression for bug 2417. Thanks to Jan Ohlenburg <jan dot ohlenburg at fit dot fraunhofer dot de> for delivering this test. This bug has not been fixed yet.
What I want to know is whether the following two lines of the patch are correct or maybe should be updated + if (!r->blocked_connect () && + !transport->connection_handler ()->is_connecting()) +
I've only browsed what's in this ticket, i.e., I haven't even pulled up the code and looked at the patch in the context of the code. However, I'm wondering if it might be a reference count problem where the reference count is not being increased properly, and that's why there's an issue with the double-delete. Also, double-delete make it sound like there should be a lock somewhere that either (a) isn't getting locked or (b) doesn't exist yet. Finally, in some other work I've done/been doing, I've wondered whether the LF_Event might need another state, CONNECTION_CLOSING, that sits between IDLE/ACTIVE and CONNECTION_CLOSED. This would allow multiple threads to make it through a closing-down connection. At first glance, I would envision something like when a thread requests for the transport to close, the state transitions to CONNECTION_CLOSING and refcount decrements. Any operation on a transport in state CONNECTION_CLOSING would do nothing, but would return a "sensible value". Finally, the last decrement of the reference count would transition to CONNECTION_CLOSED, and delete the transport.
Thanks for the feedback. The closing is not dependent on this so far as I can tell, in this use case we are still connecting and then we detect that we can't establish the connection and have to give up this transport. The refcounting stuff seems ok to me, the problem is that we give the transport to the caller which we shouldn't do.
Wed Apr 19 07:48:12 UTC 2006 Johnny Willemsen <jwillemsen@remedy.nl>
reopening, fix doesn't work for all builds Fri Apr 21 09:25:12 UTC 2006 Johnny Willemsen <jwillemsen@remedy.nl> * tao/Transport_Connector.cpp: Removed the fix for bug 2417, according to the test stats things didn't got fixed.
I've committed a change that I think fixes this bug. The solution is actually in the protocol-specific (just IIOP for now) connector. Perhaps if this solution is borne out by the scoreboard, it can be refactored to cover other protocols besides IIOP. Here's the effective change: Index: IIOP_Connector.cpp =================================================================== RCS file: /project/cvs-repository/ACE_wrappers-repository/TAO/tao/IIOP_Connector.cpp,v retrieving revision 1.151 diff -u -b -r1.151 IIOP_Connector.cpp --- IIOP_Connector.cpp 2 May 2006 04:08:43 -0000 1.151 +++ IIOP_Connector.cpp 16 May 2006 05:59:10 -0000 @@ -439,6 +439,10 @@ if (transport == tlist[i]) { svc_handler = sh_list[i]; + if (transport->connection_handler()->keep_waiting()) + { + svc_handler->add_reference(); + } iiop_endpoint = ep_list[i]; break; } And here's my changelog entry explaining why I did this: Tue May 16 05:22:15 UTC 2006 Phil Mesnier <mesnier_p@ociweb.com> * tao/IIOP_Connector.cpp: This is a potential fix for the Bug 2417 flaw. The problem is that when using nonblocking connects, in conjunction with asynch invocations, it is possible for a transport to be returned by the connector even though the network connection has not completed. For asynchronous invocations using the SYNCH_NONE policy, this is appropriate, as request messages may be queued for delivery if/when the connection completes. Bug 2417 describes a scenario where such a nonblocking connection attempt fails, but the actual failure happens after the transport has already been returned to the caller. This causes a problem because the underlying ACE connector framework relies on "borrowing" the reference to the connection handler during the time it is waiting for connections to complete or fail. For blocked connects this is fine because either the transport will be returned to the caller associated with a completely established connection, or a failure will occur. The issue for nonblocking connects is that when a transport is returned associated with a pending connection, the existing transport connector and protocol-specific connector end up associating to referrers to the same connection handler, without incrementing the reference count. The two are the transport being returned and the ACE_NonBlock_Connection_Handler that is actually registered with the reactor waiting for success or failure on the pending connection. When a connection completes OK, the NBCH surrenders its reference to the connection handler, thus restoring parity, as the transport and/or cache entry will still hold the remaining references, and the count is OK. But when the connection fails, the base connector ends up calling close() on the connection handler which in turn decrements the reference count. This then sets the stage for a later crash from an apparent double delete.
Wed Aug 30 13:20:38 UTC 2006 Phil Mesnier <mesnier_p@ociweb.com> * tao/IIOP_Connection_Handler.h: * tao/IIOP_Connection_Handler.cpp: * tao/IIOP_Connector.cpp: Revised the fix for bug 2417. That fix added a reference count to IIOP Connection Handlers associated with pending asynch connections to accomodate a reference held by an ACE component that does not participate in reference counting. However, blindly adding a reference caused the Connection_Handler to not get cleaned up as part of ORB destruction. This then caused trouble on windows platforms where any sockets that were open at program termination were aborted with a RST and undelivered stream content purged. This fix gives control of this additional reference to the connection handler itself, so that upon successful completion of connection establishment, that reference is removed and no resource is leaked.
test runs stable on head