Bug 2494

Summary: Server aborts when client disconnects
Product: TAO Reporter: Adam Mitz <mitza>
Component: ORBAssignee: Adam Mitz <mitza>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P3    
Version: 1.5   
Hardware: All   
OS: All   
Attachments: Regression test (see next comment)

Description Adam Mitz 2006-04-04 16:27:58 CDT
We (OCI on behalf of a client -- RT8431) observed two threads operating on the
same transport in a way that lead to an ASSERT in the Transport destructor (in
debug mode) and strange "hanging" behavior (in optimize mode).  Between 1.4.3
and 1.5.0 the ASSERT has been removed.  More on that below.  

The Problem:
Thread "A" is finishing an upcall, trying to send a reply.  Since Thread "A" is
in send_message_shared() it holds the handler_lock_.  It makes it all the way to
drain_queue_helper() and calls send().  Send completes sucessfully (returns a
positive # of bytes sent) but before getting to the ACE_DEBUG line of
drain_queue_helper(), this thread is scheduled out.

Thread "B" now takes over and runs handle_input() on this same Transport.  It
gets to parse_consolidate_messages() and then consolidate_message() but the call
to recv() fails because the remote process has been terminated.  Thus thread "B"
returns -1 all the way back to TAO_Connection_Handler::handle_input_eh() which
calls close_connection() on the handler.  This handler (and transport) are now
completely closed (not registered in the reactor).  None of the methods that
thread "B" executes required the handler_lock_.

Thread "A" now continues, logging the fact that it's in drain_queue_helper() and
returns to drain_queue_i(), logging that the helper returned 1.  Since the last
send() completed OK, this thread has no clue that the transport has been closed.
 It makes it back up to send_reply_message_i(), which enqueues the message and
calls the flushing strategy's schedule_output() method.  Schedule_output()
manipulates the reactor's write mask with respect to the transport's socket, but
that socket is no longer connected and its handler is no longer registered with
the reactor.  We believe this caused the strange "hanging" behavior - with the
CPU at 100% in fstat().  Also, the message that was just queued is leaked when
the Transport desructor runs (which caused an ASSERT in 1.4.3).

The fix:
Before this change, the return value from schedule_output was ignored.  With
this change, if -1 is returned, the message is dequeued and memory is reclaimed.
   Also with this change, the Transport's schedule_output_i (which is the
implmenentation of schedule_output for the default and reactive flushing
strategies) will attempt to locate this transport's connection handler in the
reactor.  If it's no longer registered, schedule_output_i will return -1.  The
Block flushing strategy already returns -1 if the connection is closed.
This fix has been checked-in to CVS with the ChangeLog entry:
Fri Mar 31 17:19:13 UTC 2006  Adam Mitz  <mitza@ociweb.com>

Testability (the ASSERT issue):
I've created a regression test for this bug.  I will be attaching the test to
this Bugzilla entry.  For the regression to fail (without the fix), test depends
on the server aborting due to the ASSERT -- as it did in 1.4.3.  Since both of
the ASSERTs in the Transport destructor has been removed, it isn't possible to
detect the bug.  The other problem with the ASSERT being removed is that memory
is being leaked when the ASSERT (would have) hit.

Questions:
1. Should the ASSERTs in the Transport destructor be re-instated?  If not, maybe
we could add code to the Transport destructor to check for this condition and
log it (with a given ORBDebugLevel) -- the regression test would parse the log
for this message.  Without one of these two options (or something else that
someone could suggest), we don't have an automated regression test.

2. Should code be added to the Transport destructor to delete its queued
messages?  Memory may leak if there are similar problems that are not solved by
the bugfix described above.

Thanks,
Adam
Comment 1 Adam Mitz 2006-04-04 16:45:08 CDT
Created attachment 514 [details]
Regression test (see next comment)
Comment 2 Adam Mitz 2006-04-04 16:45:22 CDT
Attached is the regression test, unzip in $TAO_ROOT.  The tarball contains a
diff file, "Transport_cpp.diff", that needs to be applied to Transport.cpp in
order for this test to work.  With the patch applied, normal compiliation of
Transport.cpp will give you the regression.  Defining the preprocessor macro
RT8431_FIX and recompiling Transport.cpp will show the test passing.
Comment 3 Simon McQueen 2006-04-05 01:13:11 CDT
Hi Adam, 
ACE_ASSERT compiles away to nothing on ACE_NDEBUG builds right ? Does this mean
your test would hang or just be a no-op on these builds ? I'd vote reinstate the
assert if you need it and know where it needs to go - as you probably know we
only removed the asserts because the coding guidelines said they shouldn't be
there. We offered to just put them back but then there was some talk of
replacing them with some sort of new uber assert but not until later on.
Comment 4 Adam Mitz 2006-04-05 09:55:06 CDT
Simon,
Thanks for your comments.  If we reinstate the assert I'll add a comment
pointing to the Bugzilla entry so that it's not removed again.
If this problem were to re-appear, the debug=0 builds would not fail the
regression test since, as you pointed out, the assert doesn't impact debug=0.  I
think this would be OK since the test failing every debug=1 build should be
large enough "red flag".
If there aren't any further comments I'll proceed with getting the test and the
patch to Transport.cpp (adding the ACE_ASSERT line) checked-in.

Thanks,
Adam
Comment 5 Johnny Willemsen 2006-04-05 13:22:16 CDT
I would also propose to add the asserts back and then the fix in. Btw, when the
asserts are there, could you also rerun the test on cvs head without your patch
to make sure that things wheren't fixed by accident in the doc_group version before.
Comment 6 Adam Mitz 2006-04-05 13:49:41 CDT
I have been testing against the DOC group CVS "head".  With the assert there
(and debug=1) the test fails consistently pre-patch and passes consistently
post-patch (where the "patch" is what I checked in last Friday).
Comment 7 Adam Mitz 2006-04-06 08:51:41 CDT
Regression test and change to Transport.cpp (added asserts) has been checked in.

Wed Apr  5 22:39:28 UTC 2006  Adam Mitz  <mitza@ociweb.com>

        * tests/Bug_2494_Regression/Bug_2494_Regression.mpc:
        * tests/Bug_2494_Regression/README:
        * tests/Bug_2494_Regression/client.cpp:
        * tests/Bug_2494_Regression/run_test.pl:
        * tests/Bug_2494_Regression/server.cpp:
        * tests/Bug_2494_Regression/test.idl:
        * tests/Bug_2494_Regression/test_i.h:
        * tests/Bug_2494_Regression/test_i.inl:
        * tests/Bug_2494_Regression/test_i.cpp:

          Added a regression test for Bugzilla bug #2494.

        * tao/Transport.cpp (~Transport):

          Added back the ACE_ASSERTS, needed for the regression test above.
          See Bugzilla bug #2494.
Comment 8 Adam Mitz 2006-04-12 16:38:55 CDT
scoreboard looks good
Comment 9 Johnny Willemsen 2006-04-26 09:20:17 CDT
I am reopening this bug, see the data from the test stats below, this test is
failing on the 12th on some linux builds, also the17th, but also today it is
failing already in some linux builds

2006_04_17      29       6  TAO                  Linux 
TAO/tests/Bug_2494_Regression
2006_04_17      13       0  TAO                Windows 
TAO/tests/Bug_2494_Regression
2006_04_17       7       0  TAO                Solaris 
TAO/tests/Bug_2494_Regression


2006_04_12      30       5  TAO                  Linux 
TAO/tests/Bug_2494_Regression
2006_04_12       4       0  TAO                 NetBSD 
TAO/tests/Bug_2494_Regression
2006_04_12      12       0  TAO                Windows 
TAO/tests/Bug_2494_Regression
2006_04_12       6       0  TAO                Solaris 
TAO/tests/Bug_2494_Regression
2006_04_12       1       0  TAO                  HP-UX 
TAO/tests/Bug_2494_Regression

2006_04_26      14       2  TAO                  Linux 
TAO/tests/Bug_2494_Regression
2006_04_26       3       0  TAO                Solaris 
TAO/tests/Bug_2494_Regression
2006_04_26       2       0  TAO                Windows 
TAO/tests/Bug_2494_Regression
Comment 10 Adam Mitz 2006-05-03 11:38:09 CDT
Here's what I see today (using the "clean builds" option):

   Date     #Runs   #Fails  Type  Group                 
==========  ======  ======  ====  ====================
2006_05_03      15       5  TAO                  Linux  
2006_05_03       4       1  TAO                Solaris  

It looks like it has improved since other bugs are being fixed, etc.
I'm a little confused about the Solaris results.  On the TAO scoreboard
(http://www.dre.vanderbilt.edu/scoreboard/tao.html) there is only one Solaris
build (Solaris10_Studio11_Debug) and it's passing this test.  Which Solaris
build is failing?

Looking at the test matrix I see failures on these builds:
FC4_x86_SCTP, Tiger_GCC4_Full
The other builds that it fails have red lines all the way down so they have
larger problems than this one test.

There may be a timing issue with the test.  We may need to adjust the timeout
values so that it works on more platforms. 
Comment 11 Adam Mitz 2006-06-13 09:17:36 CDT
I'm going to try to resolve this again now that I've checked-in a change that
should reduce the number of false negatives reported by the scoreboard.

Tue Jun 13 13:51:11 UTC 2006  Adam Mitz  <mitza@ociweb.com>

        * tests/Bug_2494_Regression/client.cpp:

          The scoreboard was marking this test as "failed" because it printed
          "EXCEPTION" (from the client process).  This isn't a failure as long
          as the server process still exits cleanly.  The test is now modified
          so that it doesn't print "EXCEPTION".