Bug 2417 - Double delete on Transport when using oneways with sync_none
Summary: Double delete on Transport when using oneways with sync_none
Status: RESOLVED FIXED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.5
Hardware: x86 Windows XP
: P1 critical
Assignee: DOC Center Support List (internal)
URL:
Depends on:
Blocks: 2355
  Show dependency tree
 
Reported: 2006-02-23 04:42 CST by jan.ohlenburg
Modified: 2008-03-31 04:51 CDT (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description jan.ohlenburg 2006-02-23 04:42:24 CST
Subscribers receive oneway method invokations at a high frequency. In case the
subscriber dies the publisher does not notice it if SYNC_NONE is set for the
ORB. After a few seconds the application crashes with a pure virtual function
call in Transport.cpp when the subscriber is invoked again.

My guess is that the ORB realizes after a few seconds that the transport to the
subscriber is broken and destroys it. A workaround might be to query the refence
if it is still alive.

Example: The svc method of the worker thread of the publisher:
subscriber->onData is oneway and subscriber->isAlive is not. The complete
Exploit has been emailed to the tao-user mailing list.


int Publisher_impl::Worker::svc (void)
{
	double data = 0.0;
	unsigned long now = 0;
	unsigned long nextAliveCheck = GetTickCount()+30000;
	while (!terminated)
	{
		data += 0.01;
		{
			ACE_Guard<ACE_Process_Mutex> guard(mutex, 1, 1);
			for (vector<_Subscriber>::iterator iter = subscribers.begin(); iter !=
subscribers.end(); ++iter)
			{
				if (!iter->unsubscribed)
				{
					try
					{
						now = GetTickCount();
						iter->subscriber->onData(data);
					}
					catch (...)
					{
						printf("Caught exception after %d msecs.\n", GetTickCount()-now);
						iter->unsubscribed = true;
						printf("Unsubscribed subscriber.\n");
					}
				}
			}
		}
		if (nextAliveCheck < GetTickCount())
		{
			ACE_Guard<ACE_Process_Mutex> guard(mutex, 1, 1);
			for (vector<_Subscriber>::iterator iter = subscribers.begin(); iter !=
subscribers.end(); ++iter)
			{
				if (!iter->unsubscribed)
				{
					try
					{
						now = GetTickCount();
						iter->subscriber->isAlive();
					}
					catch (...)
					{
						printf("Caught exception after %d msecs.\n", GetTickCount()-now);
						iter->unsubscribed = true;
						printf("Unsubscribed subscriber because isAlive failed.\n");
					}
				}
			}
			nextAliveCheck = GetTickCount()+30000;
		}
		Sleep(10);
	}
	return 0;
}
Comment 1 Johnny Willemsen 2006-02-24 13:51:20 CST
regresion is on my system and also gives the same error as reported. Running it
in the debugger doesn't show the reason, there the program also just ends.
Comment 2 Johnny Willemsen 2006-02-26 08:48:42 CST
maybe a method is invoked on an object that is already destructed? Any reason
why a process mutex is used instead of a thread mutex, the process mutex is
really much heavier, is synchronisation between processes required?
Comment 3 jan.ohlenburg 2006-02-26 09:10:43 CST
I don't use ACE threads in my application, therefore I just copied some code for
the mutex. There is no reason why I used a process mutex. Synchronization is
necessary, since the vector may not change during the iterations. Therefore it
has to be synchronized with adding subscribers.
Comment 4 jan.ohlenburg 2006-02-27 07:38:46 CST
I tried to narrow down the problem, the reason is, that in the destructor of
TAO_IIOP_Transport the connection_handler_ (which is returned in
event_handler_i()) is destructed. After the members of TAO_IIOP_Transport have
been destroyed, the destructor of TAO_Transport is executed and in the method
purge_entry the reference to the event_handler_i is removed, which has already
been destructed (and also the virtual function event_handler_i()). Please have a
look at the call stack:

publisher.exe!_purecall()  Line 53 + 0x7	C
publisher.exe!TAO_Transport::remove_reference()  Line 2424 + 0xd	C++
publisher.exe!TAO::Cache_IntId::~Cache_IntId()  Line 31	C++
[..snip..]
publisher.exe!TAO_Transport::purge_entry()  Line 394	C++
publisher.exe!TAO_Transport::~TAO_Transport()  Line 181	C++
publisher.exe!TAO_IIOP_Transport::~TAO_IIOP_Transport()  Line 36 + 0xf	C++
publisher.exe!TAO_IIOP_Transport::`scalar deleting destructor'()  + 0x16	C++
publisher.exe!TAO_IIOP_Connection_Handler::~TAO_IIOP_Connection_Handler()  Line
59 + 0x28	C++
publisher.exe!TAO_IIOP_Connection_Handler::`scalar deleting destructor'()  +
0x16	C++

I have created a simple work around by using a flag is_destructing_ which is set
to true at the beginning of the TAO_Transport::~TAO_Transport destructor. In the
methods 

TAO_Transport::add_reference
and
TAO_Transport::remove_reference

the flag is checked and in case it is true, it returns 0 instead of invoking the
method event_handler_i().

That works but is probably not the best thing to do. I can think of different
alternatives:

1. Invokation of a oneway method fails with an exception, in case the transport
is already terminated.
2. I can query the subscriber to check whether the transport is still valid.
3. Or is it more a question of cleaning up unfinished messages?

Hope, this helps.
Comment 5 Johnny Willemsen 2006-02-27 09:15:36 CST
Problem seems to be from a thread that has an asynchronous connect outstanding
which seems to get cancelled and then crashes things

005F3AD2 TAO_Transport::~TAO_Transport(this=:00DAA5EC)
005A49DA TAO_IIOP_Transport::~TAO_IIOP_Transport(this=:00DAA5EC)
0059BFDD TAO_IIOP_Connection_Handler::~TAO_IIOP_Connection_Handler(this=:00DAA52C)
00456CC3 ACE_Event_Handler::remove_reference(this=:00DAA52C)
005F7676 TAO_Transport::remove_reference(this=:00DAA5EC)
0056BF80 TAO_Connection_Handler::close_handler(this=:00DAA598)
0059C8A3 TAO_IIOP_Connection_Handler::close(this=:00DAA52C,  =0)
0059EE1A ACE_Connector<TAO_IIOP_Connection_Handler,
ACE_SOCK_Connector>::initialize_svc_handler(this=:00DA9988, handle=:00000668,
svc_handler=:00DAA52C)
005A0566
ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_output(this=:00DAAA64,
handle=:00000668)
005A057F
ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_exception(this=:00DAAA64,
h=:00000668)
004C7390 ACE_TP_Reactor::dispatch_socket_event(this=:00D79454,
dispatch_info=:0167FC94)
004C6FBF ACE_TP_Reactor::handle_socket_events(this=:00D79454,
event_count=:0167FCD4, guard=:0167FD14)
004C6D01 ACE_TP_Reactor::dispatch_i(this=:00D79454, max_wait_time=NULL,
guard=:0167FD14)
004C6B3D ACE_TP_Reactor::handle_events(this=:00D79454, max_wait_time=NULL)
0049FBA8 ACE_Reactor::handle_events(this=:00D79444, max_wait_time=NULL)
005C7D74 TAO_ORB_Core::run(this=:00D6B988, tv=NULL, perform_work=0)
Comment 6 Johnny Willemsen 2006-02-27 09:23:55 CST
the transport is created at the moment the server has detected the client is
gone, then we close the transports, but we keep invoking methods, so we try to
reconnect to the client again, because things are oneways we do an asynchronous
connect, this later fails with the callstack in front of this message, we try to
cleanup this transport we are still trying to connect on and this then crashes

005F386C TAO_Transport::TAO_Transport(this=:00DAA614, tag=0, orb_core=:00D6B988)
005A48C7 TAO_IIOP_Transport::TAO_IIOP_Transport(this=:00DAA614,
handler=:00DAA554, orb_core=:00D6B988,  =false)
0059BE6A
TAO_IIOP_Connection_Handler::TAO_IIOP_Connection_Handler(this=:00DAA554,
orb_core=:00D6B988, flag=false)
0059EC54
TAO_Connect_Creation_Strategy<TAO_IIOP_Connection_Handler>::make_svc_handler(this=:00DA9A38,
sh=:0137F994)
005A00DA ACE_Strategy_Connector<TAO_IIOP_Connection_Handler,
ACE_SOCK_Connector>::make_svc_handler(this=:00DA9988, sh=:0137F994)
0059F87C ACE_Connector<TAO_IIOP_Connection_Handler,
ACE_SOCK_Connector>::connect_i(this=:00DA9988, sh=:0137F994, sh_copy=NULL,
remote_addr=:00DA9B08, synch_options=:0137F94C, local_addr=:0137F9E4,
reuse_addr=0, flags=2, perms=0)
0059EE8F ACE_Connector<TAO_IIOP_Connection_Handler,
ACE_SOCK_Connector>::connect(this=:00DA9988, sh=:0137F994,
remote_addr=:00DA9B08, synch_options=:0137F94C, local_addr=:0137F9E4,
reuse_addr=0, flags=2, perms=0)
0059DF46 TAO_IIOP_Connector::make_connection(this=:00DA9960, r=:0137FC04,
desc=:0137FB30, timeout=:0137F99C)
005FB070 TAO_Connector::connect(this=:00DA9960, r=:0137FC04, desc=:0137FB30,
timeout=NULL)
005D7BDF TAO::Profile_Transport_Resolver::try_connect(this=:0137FC04,
desc=:0137FB30, max_time_value=NULL)
005ADDFB TAO_Default_Endpoint_Selector::select_endpoint(this=:00D72218,
r=:0137FC04, max_wait_time=NULL)
005D79D8 TAO::Profile_Transport_Resolver::resolve(this=:0137FC04, max_time_val=NULL)
005ACF9A TAO::Invocation_Adapter::invoke_remote_i(this=:0137FD60,
stub=:00DA9D24, details=:0137FCC0, effective_target=:0137FC74,
max_wait_time=:0137FC48)
005ACC01 TAO::Invocation_Adapter::invoke_i(this=:0137FD60, stub=:00DA9D24,
details=:0137FCC0)
005ACAAD TAO::Invocation_Adapter::invoke(this=:0137FD60, ex_data=NULL, ex_count=0)
00408CC6 Subscriber::onData(this=:00DA9E88, value=4.69999999999994)
00401884 Publisher_impl::Worker::svc(this=:00D9CAE0)
Comment 7 Johnny Willemsen 2006-02-27 10:02:02 CST
we have a double delete see the stacks of the following 2 threads

005A49C3 TAO_IIOP_Transport::~TAO_IIOP_Transport(this=:00DAAC70)
0059BFDD TAO_IIOP_Connection_Handler::~TAO_IIOP_Connection_Handler(this=:00DAABB0)
00456CC3 ACE_Event_Handler::remove_reference(this=:00DAABB0)
005F7676 TAO_Transport::remove_reference(this=:00DAAC70)
0056BF80 TAO_Connection_Handler::close_handler(this=:00DAAC1C)
0059C8A3 TAO_IIOP_Connection_Handler::close(this=:00DAABB0,  =0)
0059EE1A ACE_Connector<TAO_IIOP_Connection_Handler,
ACE_SOCK_Connector>::initialize_svc_handler(this=:00DA9FE4, handle=:00000678,
svc_handler=:00DAABB0)
005A0566
ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_output(this=:00DAB0E8,
handle=:00000678)
005A057F
ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_exception(this=:00DAB0E8,
h=:00000678)
004C7390 ACE_TP_Reactor::dispatch_socket_event(this=:00D79454,
dispatch_info=:0167FC94)
004C6FBF ACE_TP_Reactor::handle_socket_events(this=:00D79454,
event_count=:0167FCD4, guard=:0167FD14)
004C6D01 ACE_TP_Reactor::dispatch_i(this=:00D79454, max_wait_time=NULL,
guard=:0167FD14)
004C6B3D ACE_TP_Reactor::handle_events(this=:00D79454, max_wait_time=NULL)
0049FBA8 ACE_Reactor::handle_events(this=:00D79444, max_wait_time=NULL)
005C7D74 TAO_ORB_Core::run(this=:00D6B988, tv=NULL, perform_work=0)

and 

005A496C TAO_IIOP_Transport::~TAO_IIOP_Transport(this=:00DAAC70)
0059BFDD TAO_IIOP_Connection_Handler::~TAO_IIOP_Connection_Handler(this=:00DAABB0)
00456CC3 ACE_Event_Handler::remove_reference(this=:00DAABB0)
005F7676 TAO_Transport::remove_reference(this=:00DAAC70)
005D78F0
TAO::Profile_Transport_Resolver::~Profile_Transport_Resolver(this=:0137FC04)
005ACFFF TAO::Invocation_Adapter::invoke_remote_i(this=:0137FD60,
stub=:00DAA380, details=:0137FCC0, effective_target=:0137FC74,
max_wait_time=:0137FC48)
005ACC01 TAO::Invocation_Adapter::invoke_i(this=:0137FD60, stub=:00DAA380,
details=:0137FCC0)
005ACAAD TAO::Invocation_Adapter::invoke(this=:0137FD60, ex_data=NULL, ex_count=0)
00408CC6 Subscriber::onData(this=:00DAA4E4, value=3.97999999999996)
Comment 8 Johnny Willemsen 2006-02-27 13:48:32 CST
update summary
Comment 9 Johnny Willemsen 2006-02-27 14:15:40 CST
When in the method below the remove_reference is removed things also work but I
have my doubts about the correctness at this moment. This method reduces the
refcount on the transport to 0 meaning it gets deleted, after that the profile
transport resolvers gets the same transport

int
TAO_Connection_Handler::close_handler (void)
{
  this->state_changed (TAO_LF_Event::LFS_CONNECTION_CLOSED,
	               this->orb_core_->leader_follower ());
  this->transport ()->remove_reference ();
  return 0;
}


0056BF57 TAO_Connection_Handler::close_handler(this=:00DAA558)
0059C8C3 TAO_IIOP_Connection_Handler::close(this=:00DAA4EC,  =0)
0059EE3A ACE_Connector<TAO_IIOP_Connection_Handler,
ACE_SOCK_Connector>::initialize_svc_handler(this=:00DA9948, handle=:00000668,
svc_handler=:00DAA4EC)
005A0586
ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_output(this=:00DAAA24,
handle=:00000668)
005A059F
ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_exception(this=:00DAAA24,
h=:00000668)
004C7390 ACE_TP_Reactor::dispatch_socket_event(this=:00D79414,
dispatch_info=:0187FC94)
004C6FBF ACE_TP_Reactor::handle_socket_events(this=:00D79414,
event_count=:0187FCD4, guard=:0187FD14)
004C6D01 ACE_TP_Reactor::dispatch_i(this=:00D79414, max_wait_time=NULL,
guard=:0187FD14)
004C6B3D ACE_TP_Reactor::handle_events(this=:00D79414, max_wait_time=NULL)
0049FBA8 ACE_Reactor::handle_events(this=:00D79404, max_wait_time=NULL)
005C7D94 TAO_ORB_Core::run(this=:00D6B948, tv=NULL, perform_work=0)
Comment 10 Johnny Willemsen 2006-02-27 14:39:17 CST
went back in cvs, handle_close got introduced by the checkin, strange that it
says no refcount is needed on the transport but that handle_close does
decremement it. So far as I now can tell this should be gone, but have to double
check that with others.

Mon Jul 07 18:00:38 2003  Irfan Pyarali  <irfan@oomworks.com>

        * tao/IIOP_Connection_Handler:

          - Added close() method that will be called by the Acceptor or
          Connector when connection establishment fails.

          - No need to reference count the transport.  Handlers and
          transports now have the same lifespan.  Transport in created in
          the constructor and deleted in the destructor.

          - handle_input() and handle_output() intercept any errors from
          handle_input_eh() and handle_output_eh() respectively and call
          close_connection().  We no longer rely on handle_close().

          - Assert that handle_close() no longer gets called.

          - Added handle_timeout(). We don't use this upcall for I/O.
          This is only used by the Connector to indicate that the
          connection timedout.  Therefore, we should call close().

          - The unused constructor should assert(0).
Comment 11 Johnny Willemsen 2006-02-28 08:25:44 CST
In TAO_IIOP_Connector::make_connection we end up with a transport that is coming
from the cache, but it is closed from below already. The refcount is then 1,
then up in the callstack the profile_transport_resolver let the refcount go to 0
making the transport gets destructed at that moment. 
Comment 12 Johnny Willemsen 2006-02-28 10:04:55 CST
New proposed fix. Seems the transport connector is the best place to check if we
really got a correct transport back. The exact if should be reviewed in detail,
more testing is necessary for this, especially the queueing behaviour of the
sync_none should be retested.

===================================================================
RCS file:
/project/cvs-repository/ACE_wrappers-repository/TAO/tao/Transport_Connector.cpp,v
retrieving revision 1.28
diff -u -u -w -b -r1.28 Transport_Connector.cpp
--- Transport_Connector.cpp	2 Nov 2005 11:03:27 -0000	1.28
+++ Transport_Connector.cpp	28 Feb 2006 16:01:32 -0000
@@ -399,6 +399,29 @@
 
   if (result == -1)
     {
+      // We are waiting for a non blocking connection
+      // but our connection handler is not connecting anymore,
+      // force close the connection
+      if (!r->blocked_connect () &&
+          !transport->connection_handler ()->is_connecting())
+        {
+          if (TAO_debug_level > 2)
+            ACE_DEBUG ((LM_DEBUG,
+                        "TAO (%P|%t) - Transport_Connector::"
+                        "wait_for_connection_completion, "
+                        "waiting for a non blocking connection but our, "
+                        "connection handler is not connecting anymore.\n"));
+
+          // Forget the return value. We are busted anyway. Try our best
+          // here.
+          (void) this->cancel_svc_handler (transport->connection_handler ());
+
+          // Set transport to zero, it is not usable
+          transport = 0;
+
+          return false;
+        }
+
       if (!r->blocked_connect () && errno == ETIME)
         {
           // If we did a non blocking connect, just ignore
Comment 13 Johnny Willemsen 2006-03-06 09:15:46 CST
regression is in cvs

Mon Mar 06 15:13:12 2006  Johnny Willemsen  <jwillemsen@remedy.nl>

        * tests/Bug_2417_Regression/*:
          New regression for bug 2417. Thanks to Jan Ohlenburg
          <jan dot ohlenburg at fit dot fraunhofer dot de> for
          delivering this test. This bug has not been fixed yet.
Comment 14 Johnny Willemsen 2006-03-22 08:45:07 CST
What I want to know is whether the following two lines of the patch are correct
or maybe should be updated

+      if (!r->blocked_connect () &&
+          !transport->connection_handler ()->is_connecting())
+
Comment 15 Chris Cleeland 2006-03-22 10:29:11 CST
I've only browsed what's in this ticket, i.e., I haven't even pulled up the code
and looked at the patch in the context of the code.  However, I'm wondering if
it might be a reference count problem where the reference count is not being
increased properly, and that's why there's an issue with the double-delete.

Also, double-delete make it sound like there should be a lock somewhere that
either (a) isn't getting locked or (b) doesn't exist yet.

Finally, in some other work I've done/been doing, I've wondered whether the
LF_Event might need another state, CONNECTION_CLOSING, that sits between
IDLE/ACTIVE and CONNECTION_CLOSED.  This would allow multiple threads to make it
through a closing-down connection.  At first glance, I would envision something
like when a thread requests for the transport to close, the state transitions to
CONNECTION_CLOSING and refcount decrements.  Any operation on a transport in
state CONNECTION_CLOSING would do nothing, but would return a "sensible value".
 Finally, the last decrement of the reference count would transition to
CONNECTION_CLOSED, and delete the transport.
Comment 16 Johnny Willemsen 2006-03-22 10:41:36 CST
Thanks for the feedback. The closing is not dependent on this so far as I can
tell, in this use case we are still connecting and then we detect that we can't
establish the connection and have to give up this transport. The refcounting
stuff seems ok to me, the problem is that we give the transport to the caller
which we shouldn't do.
Comment 17 Johnny Willemsen 2006-04-19 07:55:44 CDT
Wed Apr 19 07:48:12 UTC 2006  Johnny Willemsen  <jwillemsen@remedy.nl>
Comment 18 Johnny Willemsen 2006-04-21 04:26:02 CDT
reopening, fix doesn't work for all builds

Fri Apr 21 09:25:12 UTC 2006  Johnny Willemsen  <jwillemsen@remedy.nl>

        * tao/Transport_Connector.cpp:
          Removed the fix for bug 2417, according to the test stats things
          didn't got fixed.
Comment 19 Phil Mesnier 2006-05-16 01:15:51 CDT
I've committed a change that I think fixes this bug. The solution is actually in
the protocol-specific (just IIOP for now) connector. Perhaps if this solution is
borne out by the scoreboard, it can be refactored to cover other protocols
besides IIOP.

Here's the effective change:
Index: IIOP_Connector.cpp
===================================================================
RCS file:
/project/cvs-repository/ACE_wrappers-repository/TAO/tao/IIOP_Connector.cpp,v
retrieving revision 1.151
diff -u -b -r1.151 IIOP_Connector.cpp
--- IIOP_Connector.cpp  2 May 2006 04:08:43 -0000       1.151
+++ IIOP_Connector.cpp  16 May 2006 05:59:10 -0000
@@ -439,6 +439,10 @@
           if (transport == tlist[i])
             {
               svc_handler = sh_list[i];
+              if (transport->connection_handler()->keep_waiting())
+                {
+                  svc_handler->add_reference();
+                }
               iiop_endpoint = ep_list[i];
               break;
         }

And here's my changelog entry explaining why I did this:
Tue May 16 05:22:15 UTC 2006  Phil Mesnier  <mesnier_p@ociweb.com>

        * tao/IIOP_Connector.cpp:

          This is a potential fix for the Bug 2417 flaw. The problem is
          that when using nonblocking connects, in conjunction with asynch
          invocations, it is possible for a transport to be returned by
          the connector even though the network connection has not
          completed. For asynchronous invocations using the SYNCH_NONE
          policy, this is appropriate, as request messages may be queued
          for delivery if/when the connection completes.

          Bug 2417 describes a scenario where such a nonblocking
          connection attempt fails, but the actual failure happens after
          the transport has already been returned to the caller. This
          causes a problem because the underlying ACE connector framework
          relies on "borrowing" the reference to the connection handler
          during the time it is waiting for connections to complete or
          fail. For blocked connects this is fine because either the
          transport will be returned to the caller associated with a
          completely established connection, or a failure will occur.

          The issue for nonblocking connects is that when a transport is
          returned associated with a pending connection, the existing
          transport connector and protocol-specific connector end up
          associating to referrers to the same connection handler, without
          incrementing the reference count. The two are the transport
          being returned and the ACE_NonBlock_Connection_Handler that is
          actually registered with the reactor waiting for success or
          failure on the pending connection.

          When a connection completes OK, the NBCH surrenders its
          reference to the connection handler, thus restoring parity, as
          the transport and/or cache entry will still hold the remaining
          references, and the count is OK. But when the connection fails,
          the base connector ends up calling close() on the connection
          handler which in turn decrements the reference count. This then
          sets the stage for a later crash from an apparent double delete.
Comment 20 Johnny Willemsen 2006-08-30 09:13:39 CDT
Wed Aug 30 13:20:38 UTC 2006  Phil Mesnier  <mesnier_p@ociweb.com>

        * tao/IIOP_Connection_Handler.h:
        * tao/IIOP_Connection_Handler.cpp:
        * tao/IIOP_Connector.cpp:

          Revised the fix for bug 2417. That fix added a reference count
          to IIOP Connection Handlers associated with pending asynch
          connections to accomodate a reference held by an ACE component
          that does not participate in reference counting. However,
          blindly adding a reference caused the Connection_Handler to not
          get cleaned up as part of ORB destruction. This then caused
          trouble on windows platforms where any sockets that were open at
          program termination were aborted with a RST and undelivered
          stream content purged.

          This fix gives control of this additional reference to the
          connection handler itself, so that upon successful completion of
          connection establishment, that reference is removed and no
          resource is leaked.
Comment 21 Johnny Willemsen 2008-03-31 04:51:49 CDT
test runs stable on head