Please report new issues athttps://github.com/DOCGroup
OK, brace yourself because the description of this bug is obscure, and we are still trying to track it down. Basically we are seeing crashes when the ORB is running nested event loops and its peer (something playing the client or server role or both) exits (or dies). In other words, the ORB is running these multiply nested event loops and a connection is closed, for some reason the connection is not cleaned up properly and all hell breaks loose. Problem is we have a hard time reproducing this bug, I'll attach a test case (that will probably go into $TAO_ROOT/tests eventually), but the test case needs multiple runs and (apparently) it only is a problem if the peer crashes while the ORB is trying to write(), i.e. if the connection closure is detected in handle_input() we have no problems (that should narrow the problem somehow). One issue seems to be that when a connection is closed the TAO_Transport::close_connection() fails to communicate this fact to the Transport_Mux_Strategy, therefore some of the nested event loops do not exit, and the stack can grow forever. That one is easy to fix (patches will be attached too), and it is necessary to fix this problem otherwise you may find it hard to find the next one: Sometimes we see other problems, where the Reactor is left using a handle that has already been closed. In this case the Reactor::handle_events() method returns -1 (uh oh), and sometimes we get crashes because the Reactor seems to access an stale Event_Handler pointer. This one we have no fix for. That's the best of my recollection, Bala please add any comments that you find relevant.
Well, changing a few things. It is an execution bug, not a compilation problem. And it may be related to 1020, though that's just a guess.
Created attachment 115 [details] This is a tar file with the test.
Attached a regression test. Basically there are two ways to run the test: $ ./server $ while /bin/true; do date; sleep 5; ./client; done this first form crashes the server because the stack grows forever (for me). May take a while and may depend on your platform and timing. Run two client loops to stress things even more. Don't worry about the clients crashing, they are supposed to (i.e. this is how we reproduce the test). For the next run use: $ ./server -b 128000 and then the same client loop, Bala reports that 256000 works better for him.
Created attachment 116 [details] A proposed patch for $TAO_ROOT/tao/Transport.cpp
I attached a patch for $TAO_ROOT/tao/Transport.cpp, basically this patch tells the Transport_Mux_Strategy about connection closures detected during write(). The patch also uses connection_closed_i() to systematically tell both the TMS and all the pending messages about the connection closure(s). However, I do not know if the connections are closed from other points, say the Connection_Handler or something. I find it funny that there are so many places where tms_->connection_closed() was called, IMHO there should be a central routine that performs all the connection closure cleanup.
I am accepting the bug. I will add my $0.02 details on what I see later
FWIW, I have reproduced the problem with the Select_Reactor too, so TP_Reactor is not fully at fault. This was on a real application, not the regression test attached to this bug report, so there is a small chance it could be the app doing something silly, but I doubt it. For the record though: the application does have sockets that are not controlled by the ORB, but as far as I can tell it follows the "Reactor Rules" to manage them.
Just for the record, TP_Reactor shares a bunch of code, especially the portions where the crash occurs with the Select_Reactor. I am not surprised with what you saw. Having said that, I still feel that something really goofy is going on in the ORB that crashes the Reactor ie. the ORB does something funny to the reactor which makes it crash.
Created attachment 119 [details] Valgrind output for the crash.
For the record: Kitty was able to reproduce the bug inside valgrind (a memory debugging tool for Linux, very cool! btw). The output messages are in the attachment above. The information is a bit hard to read, but basically an IIOP_Connection_Handler is deleted and then the Reactor tries to use it, ooops. I was able to confirm this by hacking the code a bit: eliminate the call to destroy() on the handle_close_i() method, yeah, I'm intentionally creating a memory leak, but just for debugging. Now check the value of the transport() field on calls to handle_input(). At some point it gets called with transport() == 0. The transport is only reset in handle_close_i(), right before that call to destroy() [and therefore right before delete this; is called]. So, something is broken with the reference counting for this object. The whole reference counting business is weird. I do not understand it fully, which probably indicates that it needs more simplification (so we can understand it and convince ourselves that it is doing the "Right Thing"[tm]).
Breaking news, it seems like it is not the reference counting after all. The problem seems to be around the notify() calls. In my testing the ORB is doing something like this: 1 Receive two simultaneous requests on one event handler. 2 Queue the second request using the notify() trick. 3 Send the response to the first request. 4 Ooops, the connection is broken (client died), close the connection. 5 Delete the event handler. 6 Dequeue the notification and try to use the event handler. 7 Ooooooooooops, the event handler in the notification pipe/queue is invalid 8 Crash. My current solution is to purge the notification queue between 4 and 5. Any other suggested solutions are welcomed!
Adding dependency on bug 1222. I'm trying to separate the multiple problems reported in this bug so we can clean them up one at a time.
Hmm.. Purging may be an idea if the notification queue is used in the reactor. Wouldnt work if a pipe is used to send notifications. The purge_pending_notifications () is a no-op when pipe is used. Been there done that, when trying to plug race conditions from the infamous 1020. Looks like the source of the problem is right. Need to come up with a better way to fix this. If you can give me sometime to think about this (a day or two), I think I can fix this as soon as I am back.
Two suggestions: - Increase the reference count every time the Connection_Handler is put through the notification mechanism. Then decrement the count on handle_input(), but only if the ACE_HANDLE parameter is ACE_INVALID_HANDLE. This only works if notifications have an invalid handle, while normal events have the right handle. Reading the code confirms this, but I'm not sure if there is a "promise" that this will be the case. - Instead of sending a notification to the TAO_Connection_Handler create a helper Event_Handler, something like: class TAO_Notifier : public ACE_Event_Handler { public: TAO_Notifier (TAO_Connection_Handler * ch) : ch_(ch) { // increase ch refcount } virtual ~TAO_Notifier() { // decreate ch refcount } virtual int handle_input(ACE_HANDLE h) { (void) ch_->handle_input(ch); return -1; // Automatically commit suicide... } } Thus the temporary object is created for each notification, and it automatically keeps the reference count too. In both cases we need to ensure that all notifications are indeed executed, otherwise we will leake resources at the end of the day.
Adding dependency
Adding 1233 to the list, probably a short-lived dependency, but important to document anyways.
This problem should be fixed now. This bug depended on all but 1020, and hence this should be fixed now since all of them have been fixed.
Re-opening this bug and adding dependency on a new crash (with attached regression test.)
*** Bug 978 has been marked as a duplicate of this bug. ***
This bug should be closed since all the dependancy of this bug has been satisfied.