Summary: | Race condition in TP reactor causes spin or core. | ||
---|---|---|---|
Product: | ACE | Reporter: | Phil Mesnier <mesnierp> |
Component: | ACE Core | Assignee: | Phil Mesnier <mesnierp> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | richard.rausch |
Priority: | P3 | ||
Version: | 5.5.2 | ||
Hardware: | All | ||
OS: | All | ||
Attachments: |
This is my proposed fix for this bug
This is a revised patch, addressing the inadvertent bool/int changes. A tweak to avoid problems with unbind attempts with invalid handles |
Description
Phil Mesnier
2006-09-20 10:01:46 CDT
Created attachment 604 [details]
This is my proposed fix for this bug
Reassing to reporter, Phil, you seem to revert some changes done in head, this seems to need some more work. Created attachment 609 [details]
This is a revised patch, addressing the inadvertent bool/int changes.
accepting Phil, the code still says it is a workaround, any ideas for the real fix and a regression test? You are correct. I suspect this bug is related to other TP Reactor related race conditions, but my original sponsor has reached his spending limit so I am out of time digging into this. My workaround presented here definitely addresses the side-effect of the race, but it doesn't solve the race. It might be the case that my patch is in fact the correct fix, but again I don't know enough about the TP Reactor internals to make such a declarative statement. As for a unit test, a thought I have is that I could make a pair of processes, one the sender, with many threads, and an event handler registering for write events. The other process would randomly close the connection, or abort the connection to try and replicate the error. The thing I don't understand is that for the exact failure condition to be met, the event handler for the connection must be removed from the reactor, but associated handle not removed from the reactor's wait_set_.wr_mask_. For that to happen, something else must be able to detect the socket closure and unregister the handler from the reactor. Maybe the scenario has to have the event handler registered for reading and writing, the peer closes the connection which causes the reactor to dispatch a read (eof) event. Since this is the TP reactor, the thread taking the read event removes the handle from the rd_mask to prevent the following thread to retrigger on the same event, thus leaving only the write. The read handling thread gets the eof and unregisters the event handler, but since the following thread is already in wait_for_multiple_events (either before, within, or just following) the call to select, its wait_set_ cannot be refreshed thus setting the stage for failure. Phil, did your customer had the problem on 1.4a or 1.5.2 or head? Some race conditions did get fixed already in the reactor. Did you retest on 1.4a or on head, I think we should only fix bugs when we can reproduce them on head, maybe other things made the problem already go away. Actually, my customer started out with 1.4a, then moved to 1.5.2, and finally tried DOC HEAD. I don't think any other recent changes affect this problem. My customer's goal is to settle on a stable/supported version of TAO, but for the time being they are using something locally patched so they can meet an internal deadline. Created attachment 613 [details]
A tweak to avoid problems with unbind attempts with invalid handles
Would be nice to have a regression to reproduce this. Before a final commit any code that is obsolete should be removed at least I agree completely about the regression test. I think we may actually be seeing another symptom of the underlying bug, where the reactor gets an error, enters the handler registry's unbind, gets a valid event handler and calls handle_close on it. When this handler happens to be an IIOP_Connection_Handler, this results in a core since the connection_handler's handle_close() simply contains ACE_ASSERT(0). I believe this is really Scenario 3 for my initial bug report. In this case, the thread closing down the socket does so before Thread 1 enters select, but then gets interrupted before it finish unregistering its handler, thus Thread 1 triggers the error condition. All these symptoms point back to races between threads manipulating the state of sockets and event handlers and the thread preparing a list of handles and calling select. Perhaps there is no way to address this race without severely impacting performance, putting in more locking for instance. Maybe the right way to treat this is to simply address the symptoms as they arise. In this third case, that would be removing the ASSERT(0) and perhaps replacing it with a debug output. As for a test case, perhaps a heavily threaded client that opens lots of connections for writing and randomly closes them, or writes to them might do the trick. It turns out the Oct 13 report was related to an application error, nothing in ACE or TAO. Regarding the larger "unknown reason," I now believe this is due to the use of schedule_wakeup(). By modifying the Reactor_Dispatch_Order_Test slightly I was able to simulate the error states introduced in this bug. In TAO, schedule_wakeup() is used in conjuction with asynch oneways. While Transport::schedule_output_i() tries to avoid scheduling with an event handler that is otherwise unregistered, but there is still a window after the test but before scheduling during which another thread can unregister the handler thus setting up a spin-deadlock or a crash. I think that my patch is appropriate, I will clean up the comments that refer to an unknown condition. I'm still trying to construct a TAO test to replicate this error. Please make sure you have a regression that reproduces the problem and that is in svn. With your remark of the 13th in mind we should make sure we don't change things without knowing whether it solves the problem. As I said yesterday, the issue of the 13th is not related. There was an application bug that was incorrectly calling ::close() on a socket owned by the ORB. There's no longer a mystery surrounding this issue. I have prepared an ACE regression test for this bug, it explicitly sets the preconditions rather than trying to trigger the race, which is very rare. I've also had to modify my patch to accomodate some recent changes to the Select_Reactor_Handler_Repository. My changes are committed. SVN rev 75112 contains the regression test but not the fix, rev 75113 contains the fix. *** Bug 2213 has been marked as a duplicate of this bug. *** |