Please report new issues athttps://github.com/DOCGroup
It is possible for a thread (Thread 1) to be preparing to enter select with a fully prepared collection of fd_sets while another thread (Thread 2) closes down one of the sockets and cleans up the resources associated with it. In the case observed, the affected socket was always in just the write set. Scenario 1: If this happens the moment before Thread 1 enters select, then select will return with errno = EBADF. This causes the reactor to invoke its error handling code to remove the offending handle from the fdset(s). This results in a call to Select_Reactor_T::check_handles, which calls remove_handle_i, which in turn calls ACE_Select_Reactor_Handler_Repository::unbind. At this point, unbind tries to look up the event handler associated with the specified socket descriptor and fails. Since it failed to find the handler, it does not go through the logic to remove the sd from the reactor's handle sets, and returns -1. The higher level code does not detect the failure and returns normally, and select is retried with an unmodified handle set, fails and loops again. Scenario 2: If the original cleanup occurs the moment after thread 1 exits from select but before it can dispatch the event, a segv occurs. In this case, the reactor is trying to dispatch the write event and tries to look up the event handler associated with the sd. But since the handler has been removed from the repository, find returns a null handler. This is then supplied to ACE_EH_Dispatch_Info::set which cores when it tries to reference values in the EH. I have patches that address both of these scenarios, but I wonder if there is a better solution. Unfortunately these are race results and very hard to reproduce. I do not know how to write a test case.
Created attachment 604 [details] This is my proposed fix for this bug
Reassing to reporter, Phil, you seem to revert some changes done in head, this seems to need some more work.
Created attachment 609 [details] This is a revised patch, addressing the inadvertent bool/int changes.
accepting
Phil, the code still says it is a workaround, any ideas for the real fix and a regression test?
You are correct. I suspect this bug is related to other TP Reactor related race conditions, but my original sponsor has reached his spending limit so I am out of time digging into this. My workaround presented here definitely addresses the side-effect of the race, but it doesn't solve the race. It might be the case that my patch is in fact the correct fix, but again I don't know enough about the TP Reactor internals to make such a declarative statement. As for a unit test, a thought I have is that I could make a pair of processes, one the sender, with many threads, and an event handler registering for write events. The other process would randomly close the connection, or abort the connection to try and replicate the error. The thing I don't understand is that for the exact failure condition to be met, the event handler for the connection must be removed from the reactor, but associated handle not removed from the reactor's wait_set_.wr_mask_. For that to happen, something else must be able to detect the socket closure and unregister the handler from the reactor. Maybe the scenario has to have the event handler registered for reading and writing, the peer closes the connection which causes the reactor to dispatch a read (eof) event. Since this is the TP reactor, the thread taking the read event removes the handle from the rd_mask to prevent the following thread to retrigger on the same event, thus leaving only the write. The read handling thread gets the eof and unregisters the event handler, but since the following thread is already in wait_for_multiple_events (either before, within, or just following) the call to select, its wait_set_ cannot be refreshed thus setting the stage for failure.
Phil, did your customer had the problem on 1.4a or 1.5.2 or head? Some race conditions did get fixed already in the reactor. Did you retest on 1.4a or on head, I think we should only fix bugs when we can reproduce them on head, maybe other things made the problem already go away.
Actually, my customer started out with 1.4a, then moved to 1.5.2, and finally tried DOC HEAD. I don't think any other recent changes affect this problem. My customer's goal is to settle on a stable/supported version of TAO, but for the time being they are using something locally patched so they can meet an internal deadline.
Created attachment 613 [details] A tweak to avoid problems with unbind attempts with invalid handles
Would be nice to have a regression to reproduce this. Before a final commit any code that is obsolete should be removed at least
I agree completely about the regression test. I think we may actually be seeing another symptom of the underlying bug, where the reactor gets an error, enters the handler registry's unbind, gets a valid event handler and calls handle_close on it. When this handler happens to be an IIOP_Connection_Handler, this results in a core since the connection_handler's handle_close() simply contains ACE_ASSERT(0). I believe this is really Scenario 3 for my initial bug report. In this case, the thread closing down the socket does so before Thread 1 enters select, but then gets interrupted before it finish unregistering its handler, thus Thread 1 triggers the error condition. All these symptoms point back to races between threads manipulating the state of sockets and event handlers and the thread preparing a list of handles and calling select. Perhaps there is no way to address this race without severely impacting performance, putting in more locking for instance. Maybe the right way to treat this is to simply address the symptoms as they arise. In this third case, that would be removing the ASSERT(0) and perhaps replacing it with a debug output. As for a test case, perhaps a heavily threaded client that opens lots of connections for writing and randomly closes them, or writes to them might do the trick.
It turns out the Oct 13 report was related to an application error, nothing in ACE or TAO. Regarding the larger "unknown reason," I now believe this is due to the use of schedule_wakeup(). By modifying the Reactor_Dispatch_Order_Test slightly I was able to simulate the error states introduced in this bug. In TAO, schedule_wakeup() is used in conjuction with asynch oneways. While Transport::schedule_output_i() tries to avoid scheduling with an event handler that is otherwise unregistered, but there is still a window after the test but before scheduling during which another thread can unregister the handler thus setting up a spin-deadlock or a crash. I think that my patch is appropriate, I will clean up the comments that refer to an unknown condition. I'm still trying to construct a TAO test to replicate this error.
Please make sure you have a regression that reproduces the problem and that is in svn. With your remark of the 13th in mind we should make sure we don't change things without knowing whether it solves the problem.
As I said yesterday, the issue of the 13th is not related. There was an application bug that was incorrectly calling ::close() on a socket owned by the ORB. There's no longer a mystery surrounding this issue. I have prepared an ACE regression test for this bug, it explicitly sets the preconditions rather than trying to trigger the race, which is very rare. I've also had to modify my patch to accomodate some recent changes to the Select_Reactor_Handler_Repository.
My changes are committed. SVN rev 75112 contains the regression test but not the fix, rev 75113 contains the fix.
*** Bug 2213 has been marked as a duplicate of this bug. ***