1202 – ORB crashes if peer exits or dies while running nested event loops

Bug 1202 - ORB crashes if peer exits or dies while running nested event loops

Summary: ORB crashes if peer exits or dies while running nested event loops

Status:	RESOLVED FIXED

Alias:	None

Product:	TAO
Classification:	Unclassified
Component:	ORB (show other bugs)
Version:	1.2.2
Hardware:	All All

Importance:	P3 normal
Assignee:	Nanbor Wang

URL:

Duplicates (1):	978 (view as bug list)
Depends on:	1020 1222 1230 1231 1233 1269 1270
Blocks:	1277
	Show dependency tree

Reported:	2002-05-08 18:42 CDT by Carlos O'Ryan
Modified:	2002-11-30 22:48 CST (History)
CC List:	1 user (show)

See Also:

Attachments
This is a tar file with the test. (7.04 KB, application/octet-stream) 2002-05-08 18:44 CDT, Carlos O'Ryan	Details
A proposed patch for $TAO_ROOT/tao/Transport.cpp (2.92 KB, patch) 2002-05-08 18:48 CDT, Carlos O'Ryan	Details
Valgrind output for the crash. (4.62 KB, text/plain) 2002-06-10 09:15 CDT, Carlos O'Ryan	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Carlos O'Ryan 2002-05-08 18:42:53 CDT

OK, brace yourself because the description of this bug is obscure, and we are
still trying to track it down.

Basically we are seeing crashes when the ORB is running nested event loops and
its peer (something playing the client or server role or both) exits (or dies).
 In other words, the ORB is running these multiply nested event loops and a
connection is closed, for some reason the connection is not cleaned up properly
and all hell breaks loose.

Problem is we have a hard time reproducing this bug, I'll attach a test case
(that will probably go into $TAO_ROOT/tests eventually), but the test case needs
multiple runs and (apparently) it only is a problem if the peer crashes while
the ORB is trying to write(), i.e. if the connection closure is detected in
handle_input() we have no problems (that should narrow the problem somehow).

One issue seems to be that when a connection is closed the
TAO_Transport::close_connection() fails to communicate this fact to the
Transport_Mux_Strategy, therefore some of the nested event loops do not exit,
and the stack can grow forever.  That one is easy to fix (patches will be
attached too), and it is necessary to fix this problem otherwise you may find it
hard to find the next one:

Sometimes we see other problems, where the Reactor is left using a handle that
has already been closed.  In this case the Reactor::handle_events() method
returns -1 (uh oh), and sometimes we get crashes because the Reactor seems to
access an stale Event_Handler pointer.  This one we have no fix for.

That's the best of my recollection, Bala please add any comments that you find
relevant.

Comment 1 Carlos O'Ryan 2002-05-08 18:43:41 CDT

Well, changing a few things.  It is an execution bug, not a compilation problem.

And it may be related to 1020, though that's just a guess.

Comment 2 Carlos O'Ryan 2002-05-08 18:44:30 CDT

Created attachment 115 [details]
This is a tar file with the test.

Comment 3 Carlos O'Ryan 2002-05-08 18:46:52 CDT

Attached a regression test.  Basically there are two ways to run the test:

$ ./server
$ while /bin/true; do date; sleep 5; ./client; done

this first form crashes the server because the stack grows forever (for me). 
May take a while and may depend on your platform and timing.  Run two client
loops to stress things even more.  Don't worry about the clients crashing, they
are supposed to (i.e. this is how we reproduce the test).

For the next run use:

$ ./server -b 128000

and then the same client loop, Bala reports that 256000 works better for him.

Comment 4 Carlos O'Ryan 2002-05-08 18:48:59 CDT

Created attachment 116 [details]
A proposed patch for $TAO_ROOT/tao/Transport.cpp

Comment 5 Carlos O'Ryan 2002-05-08 18:51:38 CDT

I attached a patch for $TAO_ROOT/tao/Transport.cpp, basically this patch tells
the Transport_Mux_Strategy about connection closures detected during write().

The patch also uses connection_closed_i() to systematically tell both the TMS
and all the pending messages about the connection closure(s).  However, I do not
know if the connections are closed from other points, say the Connection_Handler
or something.

I find it funny that there are so many places where tms_->connection_closed()
was called, IMHO there should be a central routine that performs all the
connection closure cleanup.

Comment 6 Nanbor Wang 2002-05-10 09:27:59 CDT

I am accepting the bug. I will add my $0.02 details on what I see later

Comment 7 Carlos O'Ryan 2002-05-23 03:01:49 CDT

FWIW, I have reproduced the problem with the Select_Reactor too, so TP_Reactor
is not fully at fault.  This was on a real application, not the regression test
attached to this bug report, so there is a small chance it could be the app
doing something silly, but I doubt it.

For the record though: the application does have sockets that are not controlled
by the ORB, but as far as I can tell it follows the "Reactor Rules" to manage them.

Comment 8 Nanbor Wang 2002-05-24 12:17:36 CDT

Just for the record, TP_Reactor shares a bunch of code, especially the portions 
where the crash occurs with the Select_Reactor. I am not surprised with what 
you saw. 

Having said that, I still feel that something really goofy is going on in the 
ORB that crashes the Reactor ie. the ORB does something funny to the reactor 
which makes it crash.

Comment 9 Carlos O'Ryan 2002-06-10 09:15:25 CDT

Created attachment 119 [details]
Valgrind output for the crash.

Comment 10 Carlos O'Ryan 2002-06-10 09:20:23 CDT

For the record: Kitty was able to reproduce the bug inside valgrind (a memory
debugging tool for Linux, very cool! btw).  The output messages are in the
attachment above.  The information is a bit hard to read, but basically an
IIOP_Connection_Handler is deleted and then the Reactor tries to use it, ooops.

I was able to confirm this by hacking the code a bit: eliminate the call to
destroy() on the handle_close_i() method, yeah, I'm intentionally creating a
memory leak, but just for debugging.  Now check the value of the transport()
field on calls to handle_input().  At some point it gets called with transport()
== 0.  The transport is only reset in handle_close_i(), right before that call
to destroy() [and therefore right before delete this; is called].

So, something is broken with the reference counting for this object.  The whole
reference counting business is weird.  I do not understand it fully, which
probably indicates that it needs more simplification (so we can understand it
and convince ourselves that it is doing the "Right Thing"[tm]).

Comment 11 Carlos O'Ryan 2002-06-11 14:32:16 CDT

Breaking news, it seems like it is not the reference counting after all.  The
problem seems to be around the notify() calls.  In my testing the ORB is doing
something like this:

1 Receive two simultaneous requests on one event handler.
2 Queue the second request using the notify() trick.
3 Send the response to the first request.
4 Ooops, the connection is broken (client died), close the connection.
5 Delete the event handler.
6 Dequeue the notification and try to use the event handler.
7 Ooooooooooops, the event handler in the notification pipe/queue is invalid
8 Crash.

My current solution is to purge the notification queue between 4 and 5.  Any
other suggested solutions are welcomed!

Comment 12 Carlos O'Ryan 2002-06-12 18:09:29 CDT

Adding dependency on bug 1222.  I'm trying to separate the multiple problems
reported in this bug so we can clean them up one at a time.

Comment 13 Nanbor Wang 2002-06-13 23:56:07 CDT

Hmm.. Purging may be an idea if the notification queue is used in the reactor. 
Wouldnt work if a pipe is used to send notifications. The 
purge_pending_notifications () is a no-op when pipe is used. Been there done 
that, when trying to plug race conditions from the infamous 1020. 

Looks like the source of the problem is right. Need to come up with a better 
way to fix this. If you can give me sometime to think about this (a day or 
two), I think I can fix this as soon as I am back.

Comment 14 Carlos O'Ryan 2002-06-14 09:43:54 CDT

Two suggestions:

- Increase the reference count every time the Connection_Handler is put through
the notification mechanism.  Then decrement the count on handle_input(), but
only if the ACE_HANDLE parameter is ACE_INVALID_HANDLE.    This only works if
notifications have an invalid handle, while normal events have the right handle.
 Reading the code confirms this, but I'm not sure if there is a "promise" that
this will be the case.

- Instead of sending a notification to the TAO_Connection_Handler create a
helper Event_Handler, something like:

class TAO_Notifier : public ACE_Event_Handler
{
public:
  TAO_Notifier (TAO_Connection_Handler * ch) : ch_(ch)
  {
     // increase ch refcount
  }
  virtual ~TAO_Notifier()
  {
     // decreate ch refcount
  }
  virtual int handle_input(ACE_HANDLE h)
  {
    (void) ch_->handle_input(ch);
    return -1; // Automatically commit suicide...
  }
}

    Thus the temporary object is created for each notification, and it
automatically keeps the reference count too.

In both cases we need to ensure that all notifications are indeed executed,
otherwise we will leake resources at the end of the day.

Comment 15 Nanbor Wang 2002-06-18 13:41:23 CDT

Adding dependency

Comment 16 Carlos O'Ryan 2002-06-20 12:13:55 CDT

Adding 1233 to the list, probably a short-lived dependency, but important to
document anyways.

Comment 17 Nanbor Wang 2002-07-03 13:34:56 CDT

This problem should be fixed now. This bug depended on all but 1020, and hence
this should be fixed now since all of them have been fixed.

Comment 18 Carlos O'Ryan 2002-08-06 10:46:23 CDT

Re-opening this bug and adding dependency on a new crash (with attached
regression test.)

Comment 19 Carlos O'Ryan 2002-08-10 21:54:39 CDT

*** Bug 978 has been marked as a duplicate of this bug. ***

Comment 20 Nanbor Wang 2002-11-30 22:48:36 CST

This bug should be closed since all the dependancy of this bug has been 
satisfied.