Bug 2653

Summary:	Race condition in TP reactor causes spin or core.
Product:	ACE	Reporter:	Phil Mesnier <mesnierp>
Component:	ACE Core	Assignee:	Phil Mesnier <mesnierp>
Status:	RESOLVED FIXED
Severity:	normal	CC:	richard.rausch
Priority:	P3
Version:	5.5.2
Hardware:	All
OS:	All
Attachments:	This is my proposed fix for this bug This is a revised patch, addressing the inadvertent bool/int changes. A tweak to avoid problems with unbind attempts with invalid handles

Description Phil Mesnier 2006-09-20 10:01:46 CDT

It is possible for a thread (Thread 1) to be preparing to enter select with a
fully prepared collection of fd_sets while another thread (Thread 2) closes down
one of the sockets and cleans up the resources associated with it. In the case
observed, the affected socket was always in just the write set.

Scenario 1:
If this happens the moment before Thread 1 enters select, then select will
return with errno = EBADF. This causes the reactor to invoke its error handling
code to remove the offending handle from the fdset(s). This results in a call to
Select_Reactor_T::check_handles, which calls remove_handle_i, which in turn
calls ACE_Select_Reactor_Handler_Repository::unbind. 

At this point, unbind tries to look up the event handler associated with the
specified socket descriptor and fails. Since it failed to find the handler, it
does not go through the logic to remove the sd from the reactor's handle sets,
and returns -1. The higher level code does not detect the failure and returns
normally, and select is retried with an unmodified handle set, fails and loops
again.

Scenario 2:
If the original cleanup occurs the moment after thread 1 exits from select but
before it can dispatch the event, a segv occurs. In this case, the reactor is
trying to dispatch the write event and tries to look up the event handler
associated with the sd. But since the handler has been removed from the
repository, find returns a null handler. This is then supplied to
ACE_EH_Dispatch_Info::set which cores when it tries to reference values in the EH.

I have patches that address both of these scenarios, but I wonder if there is a
better solution. 

Unfortunately these are race results and very hard to reproduce. I do not know
how to write a test case.

Comment 1 Phil Mesnier 2006-09-20 10:04:06 CDT

Created attachment 604 [details]
This is my proposed fix for this bug

Comment 2 Johnny Willemsen 2006-09-25 05:00:05 CDT

Reassing to reporter, Phil, you seem to revert some changes done in head, this
seems to need some more work.

Comment 3 Phil Mesnier 2006-09-25 16:58:34 CDT

Created attachment 609 [details]
This is a revised patch, addressing the inadvertent bool/int changes.

Comment 4 Phil Mesnier 2006-09-25 17:00:08 CDT

accepting

Comment 5 Johnny Willemsen 2006-09-26 00:59:08 CDT

Phil, the code still says it is a workaround, any ideas for the real fix and a
regression test?

Comment 6 Phil Mesnier 2006-09-26 07:45:40 CDT

You are correct. I suspect this bug is related to other TP Reactor related race
conditions, but my original sponsor has reached his spending limit so I am out
of time digging into this. My workaround presented here definitely addresses the
side-effect of the race, but it doesn't solve the race.

It might be the case that my patch is in fact the correct fix, but again I don't
know enough about the TP Reactor internals to make such a declarative statement. 

As for a unit test, a thought I have is that I could make a pair of processes,
one the sender, with many threads, and an event handler registering for write
events. The other process would randomly close the connection, or abort the
connection to try and replicate the error.

The thing I don't understand is that for the exact failure condition to be met,
the event handler for the connection must be removed from the reactor, but
associated handle not removed from the reactor's wait_set_.wr_mask_. For that to
happen, something else must be able to detect the socket closure and unregister
the handler from the reactor.

Maybe the scenario has to have the event handler registered for reading and
writing, the peer closes the connection which causes the reactor to dispatch a
read (eof) event. Since this is the TP reactor, the thread taking the read event
removes the handle from the rd_mask to prevent the following thread to retrigger
on the same event, thus leaving only the write. The read handling thread gets
the eof and unregisters the event handler, but since the following thread is
already in wait_for_multiple_events (either before, within, or just following)
the call to select, its wait_set_ cannot be refreshed thus setting the stage for
failure.

Comment 7 Johnny Willemsen 2006-09-26 07:51:04 CDT

Phil, did your customer had the problem on 1.4a or 1.5.2 or head? Some race
conditions did get fixed already in the reactor. Did you retest on 1.4a or on
head, I think we should only fix bugs when we can reproduce them on head, maybe
other things made the problem already go away.

Comment 8 Phil Mesnier 2006-09-26 11:54:12 CDT

Actually, my customer started out with 1.4a, then moved to 1.5.2, and finally
tried DOC HEAD.

I don't think any other recent changes affect this problem. 

My customer's goal is to settle on a stable/supported version of TAO, but for
the time being they are using something locally patched so they can meet an
internal deadline.

Comment 9 Phil Mesnier 2006-10-12 15:53:40 CDT

Created attachment 613 [details]
A tweak to avoid problems with unbind attempts with invalid handles

Comment 10 Johnny Willemsen 2006-10-13 02:53:12 CDT

Would be nice to have a regression to reproduce this. Before a final commit any
code that is obsolete should be removed at least

Comment 11 Phil Mesnier 2006-10-13 07:48:00 CDT

I agree completely about the regression test. I think we may actually be seeing
another symptom of the underlying bug, where the reactor gets an error, enters
the handler registry's unbind, gets a valid event handler and calls handle_close
on it. When this handler happens to be an IIOP_Connection_Handler, this results
in a core since the connection_handler's handle_close() simply contains
ACE_ASSERT(0).

I believe this is really Scenario 3 for my initial bug report. In this case, the
thread closing down the socket does so before Thread 1 enters select, but then
gets interrupted before it finish unregistering its handler, thus Thread 1
triggers the error condition.

All these symptoms point back to races between threads manipulating the state of
sockets and event handlers and the thread preparing a list of handles and
calling select. Perhaps there is no way to address this race without severely
impacting performance, putting in more locking for instance. Maybe the right way
to treat this is to simply address the symptoms as they arise. In this third
case, that would be removing the ASSERT(0) and perhaps replacing it with a debug
output.

As for a test case, perhaps a heavily threaded client that opens lots of
connections for writing and randomly closes them, or writes to them might do the
trick.

Comment 12 Phil Mesnier 2006-10-26 21:01:35 CDT

It turns out the Oct 13 report was related to an application error, nothing in
ACE or TAO.

Regarding the larger "unknown reason," I now believe this is due to the use of
schedule_wakeup(). By modifying the Reactor_Dispatch_Order_Test slightly I was
able to simulate the error states introduced in this bug.

In TAO, schedule_wakeup() is used in conjuction with asynch oneways. While
Transport::schedule_output_i() tries to avoid scheduling with an event handler
that is otherwise unregistered, but there is still a window after the test but
before scheduling during which another thread can unregister the handler thus
setting up a spin-deadlock or a crash. 

I think that my patch is appropriate, I will clean up the comments that refer to
an unknown condition.

I'm still trying to construct a TAO test to replicate this error.

Comment 13 Johnny Willemsen 2006-10-27 13:52:46 CDT

Please make sure you have a regression that reproduces the problem and that is
in svn. With your remark of the 13th in mind we should make sure we don't change
things without knowing whether it solves the problem.

Comment 14 Phil Mesnier 2006-10-27 15:48:12 CDT

As I said yesterday, the issue of the 13th is not related. There was an
application bug that was incorrectly calling ::close() on a socket owned by the
ORB. There's no longer a mystery surrounding this issue.

I have prepared an ACE regression test for this bug, it explicitly sets the
preconditions rather than trying to trigger the race, which is very rare.

I've also had to modify my patch to accomodate some recent changes to the
Select_Reactor_Handler_Repository.

Comment 15 Phil Mesnier 2006-10-27 18:09:37 CDT

My changes are committed. SVN rev 75112 contains the regression test but not the
fix, rev 75113 contains the fix.

Comment 16 Richard Rausch 2006-12-05 11:03:41 CST

*** Bug 2213 has been marked as a duplicate of this bug. ***