Summary: | Connection closure during oneway send causes core | ||
---|---|---|---|
Product: | TAO | Reporter: | Phil Mesnier <mesnierp> |
Component: | ORB | Assignee: | Phil Mesnier <mesnierp> |
Status: | RESOLVED FIXED | ||
Severity: | normal | ||
Priority: | P3 | ||
Version: | 1.5.2 | ||
Hardware: | All | ||
OS: | All | ||
Bug Depends on: | |||
Bug Blocks: | 3680 | ||
Attachments: |
This is my proposed fix for this bug
This is my revised patch, superceding the previous one |
Description
Phil Mesnier
2006-09-20 12:55:57 CDT
Created attachment 605 [details]
This is my proposed fix for this bug
to reported, Phil, I don't think we should make asserts dependent on a debug level. Furthermore, shouldn't ACE_ERROR be used instead of ACE_ERROR. Is the path you changed the only way a transport can be closed, with the new method we need to make sure that this is the only path, because when there is another path, that also has to be changed. We should get a 100% fix. Hi Johnny, The reason I put the asserts in the debug level was that they were only added specifically for validating the Bug_2494_Regression test. The condition that would trip the assert should not be fatal in a running application, thus my proposed change to simply flush the buffer at that point. I also intend to modify the run_test script for Bug_2494_Regression so that it runs the processes at an appropriate debug level to trigger an assert. Regarding the use of ACE_ERROR instead of ACE_DEBUG, you are right. Regarding the path to closure, the change I made is in the only place I can detect closure happening. The unfortunate nature of this test is that the event that seems to close the connection and set the stage for failure happens on another thread sometime before the fatal event. I cannot think of a way to detect that. The new method is required to set the private is_connected_ flag back to false, but this also presents the possiblity that some other method want to call it in the future. Given that the failure occurs exactly when the reference count on the transport goes to zero and the transport believes it is connected and it has data queued. It might be possible that some other set of circumstances are causing the reference count to go to zero without the connection being closed. I can't think of any, but that would also cause this scenario to happen. All I know for certain is that the complex test scenario I have that occasionally exhibits this bug stops exhibiting it after I apply the proposed patch. As with bug 2653, I have no idea how to write a regression test for this bug. Race conditions suck. I now belive the characterization of this bug is somewhat incorrect, and this will lead to a different patch. The problem is still a race, but somewhat different than what I originally described. The race at close is still there, with the potential for queuing a message just as the transport is being closed. Therefore, for symmetry at least, I believe the pre_close() method is still necessary. However the symptom of an assert failure in the Transport dtor was still occuring, but now the problem was that the transports cache_entry_ was not null, as it should be. This is due to an incomplete fix for bug 2417. The rest of this analysis focuses on IIOP, the solution should be applied to all protocols that may be used asynchronously. When a non-blocking connection is attempted that is destined to fail, the failure may be detected during the initial call to connect in which case it is handled immediately, or it may happen some time later, in which another thread is involved in handling the failure. If another thread handles the clean up, it may occur while the first thread is still within the connector's make_connection call or after. If it happens after make_connection returns, the existing code handled it. The problem case is when the connect failure is handled by another thread while the first thread is still in make_connection(). make_connection() is expected to return a valid reference to a transport, of which the caller assumes ownership. During make_connection() (which calls complete_connection in IIOP to accomodate parallel connects) there are some initial checks for validity after which the transport is assumed to be OK, and is placed in the cache and returned to the caller. If a second thread handles a connection failure during this time it can lead to either a stolen reference or to a closed connection in the cache. The first condition is fatal, the second thread handling the connect failure removes a reference to the transport via Connection_Handler::close_handler(). But then this transport is cached, and returned to the profile transport resolver, which assumes ownership of a now stale reference. When the PTR goes out of scope it removes the last reference from the transport, which deletes itself even though the cache still has an entry, thus triggering a segv when the cache is cleaned up or that transport is otherwise referenced via the cache entry, or an abort if the ASSERT in the transport's dtor is compiled in. The second condition is not fatal, but it consumes a slot in the transport cache for an otherwise unusable transport. The solution is to add additional checks during the make_connection() method to ensure that late-failing connections are caught and addressed. This boils down to two additional checks for transport->connection_handler()->error_detected() Since this is fundamentally a race condition, I have added to my patch some optional time delays that can be compiled into IIOP_Connector.cpp. Find #define BUG_2654_A, #define BUG_2654_B, #define BUG_2654_C, and #define BUG_2654_D. These can be used in conjunction with the new tests/Bug_2654_Regression test. To see the failure in action, also comment out the two new error_detected checks in IIOP_Connector.cpp. Created attachment 617 [details]
This is my revised patch, superceding the previous one
The fix is committed, and applied where appropriate to non-IIOP connectors. |