Bug 2654

Summary: Connection closure during oneway send causes core
Product: TAO Reporter: Phil Mesnier <mesnierp>
Component: ORBAssignee: Phil Mesnier <mesnierp>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P3    
Version: 1.5.2   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 3680    
Attachments: This is my proposed fix for this bug
This is my revised patch, superceding the previous one

Description Phil Mesnier 2006-09-20 12:55:57 CDT
When a multithreaded application wants to send a oneway message, and the
connection to the target peer goes away, it is possible for asserts in
TAO_Transport dtor to fail, causing an abort. The asserts were added in support
of bug 2494 regression testing. 

The scenario I observed ends with the Profile_Transport_Resolver used during an
invocation being the last reference holder on a transport. At some point during
the invocation, the request is queued and at the end of the the invocation the
PTR goes out of scope, removes the last reference from the transport causing it
to delete itself. But the transport still believes it is connected and thus hits
the assert which pops.

I have been able to reduce the incidence of this failure by adding a method,
TAO_Transport::pre_close() which is called by the connection handler in place of
purge_connection. pre_close in this case resets the Transport's is_connected
flag and then calls purge_connection. 

Unfortunately this has not entirely solved the problem, but it has narrowed the
window of failure, making the error much less common.
Comment 1 Phil Mesnier 2006-09-20 12:56:31 CDT
Created attachment 605 [details]
This is my proposed fix for this bug
Comment 2 Johnny Willemsen 2006-09-25 05:05:06 CDT
to reported, Phil, I don't think we should make asserts dependent on a debug
level. Furthermore, shouldn't ACE_ERROR be used instead of ACE_ERROR. Is the
path you changed the only way a transport can be closed, with the new method we
need to make sure that this is the only path, because when there is another
path, that also has to be changed. We should get a 100% fix.
Comment 3 Phil Mesnier 2006-09-25 17:19:19 CDT
Hi Johnny,

The reason I put the asserts in the debug level was that they were only added
specifically for validating the Bug_2494_Regression test. The condition that
would trip the assert should not be fatal in a running application, thus my
proposed change to simply flush the buffer at that point. I also intend to
modify the run_test script for Bug_2494_Regression so that it runs the processes
at an appropriate debug level to trigger an assert.

Regarding the use of ACE_ERROR instead of ACE_DEBUG, you are right.

Regarding the path to closure, the change I made is in the only place I can
detect closure happening. The unfortunate nature of this test is that the event
that seems to close the connection and set the stage for failure happens on
another thread sometime before the fatal event. I cannot think of a way to
detect that. The new method is required to set the private is_connected_ flag
back to false, but this also presents the possiblity that some other method want
to call it in the future. 

Given that the failure occurs exactly when the reference count on the transport
goes to zero and the transport believes it is connected and it has data queued.
It might be possible that some other set of circumstances are causing the
reference count to go to zero without the connection being closed. I can't think
of any, but that would also cause this scenario to happen.

All I know for certain is that the complex test scenario I have that
occasionally exhibits this bug stops exhibiting it after I apply the proposed
patch. 

As with bug 2653, I have no idea how to write a regression test for this bug.

Race conditions suck.

Comment 4 Phil Mesnier 2006-10-29 08:32:26 CST
I now belive the characterization of this bug is somewhat incorrect, and this
will lead to a different patch. The problem is still a race, but somewhat
different than what I originally described. 

The race at close is still there, with the potential for queuing a message just
as the transport is being closed. Therefore, for symmetry at least, I believe
the pre_close() method is still necessary. 

However the symptom of an assert failure in the Transport dtor was still
occuring, but now the problem was that the transports cache_entry_ was not null,
as it should be. This is due to an incomplete fix for bug 2417. 

The rest of this analysis focuses on IIOP, the solution should be applied to all
protocols that may be used asynchronously.

When a non-blocking connection is attempted that is destined to fail, the
failure may be detected during the initial call to connect in which case it is
handled immediately, or it may happen some time later, in which another thread
is involved in handling the failure. If another thread handles the clean up, it
may occur while the first thread is still within the connector's make_connection
call or after. If it happens after make_connection returns, the existing code
handled it. The problem case is when the connect failure is handled by another
thread while the first thread is still in make_connection().

make_connection() is expected to return a valid reference to a transport, of
which the caller assumes ownership. During make_connection() (which calls
complete_connection in IIOP to accomodate parallel connects) there are some
initial checks for validity after which the transport is assumed to be OK, and
is placed in the cache and returned to the caller. If a second thread handles a
connection failure during this time it can lead to either a stolen reference or
to a closed connection in the cache. 

The first condition is fatal, the second thread handling the connect failure
removes a reference to the transport via Connection_Handler::close_handler().
But then this transport is cached, and returned to the profile transport
resolver, which assumes ownership of a now stale reference. When the PTR goes
out of scope it removes the last reference from the transport, which deletes
itself even though the cache still has an entry, thus triggering a segv when the
cache is cleaned up or that transport is otherwise referenced via the cache
entry, or an abort if the ASSERT in the transport's dtor is compiled in.

The second condition is not fatal, but it consumes a slot in the transport cache
for an otherwise unusable transport.

The solution is to add additional checks during the make_connection() method to
ensure that late-failing connections are caught and addressed. This boils down
to two additional checks for transport->connection_handler()->error_detected() 

Since this is fundamentally a race condition, I have added to my patch some
optional time delays that can be compiled into IIOP_Connector.cpp. Find #define
BUG_2654_A, #define BUG_2654_B, #define BUG_2654_C, and #define BUG_2654_D.
These can be used in conjunction with the new tests/Bug_2654_Regression test. To
see the failure in action, also comment out the two new error_detected checks in
IIOP_Connector.cpp.
Comment 5 Phil Mesnier 2006-10-29 08:42:31 CST
Created attachment 617 [details]
This is my revised patch, superceding the previous one
Comment 6 Phil Mesnier 2006-11-03 09:27:56 CST
The fix is committed, and applied where appropriate to non-IIOP connectors.