Please report new issues athttps://github.com/DOCGroup
We (OCI on behalf of a client -- RT8431) observed two threads operating on the same transport in a way that lead to an ASSERT in the Transport destructor (in debug mode) and strange "hanging" behavior (in optimize mode). Between 1.4.3 and 1.5.0 the ASSERT has been removed. More on that below. The Problem: Thread "A" is finishing an upcall, trying to send a reply. Since Thread "A" is in send_message_shared() it holds the handler_lock_. It makes it all the way to drain_queue_helper() and calls send(). Send completes sucessfully (returns a positive # of bytes sent) but before getting to the ACE_DEBUG line of drain_queue_helper(), this thread is scheduled out. Thread "B" now takes over and runs handle_input() on this same Transport. It gets to parse_consolidate_messages() and then consolidate_message() but the call to recv() fails because the remote process has been terminated. Thus thread "B" returns -1 all the way back to TAO_Connection_Handler::handle_input_eh() which calls close_connection() on the handler. This handler (and transport) are now completely closed (not registered in the reactor). None of the methods that thread "B" executes required the handler_lock_. Thread "A" now continues, logging the fact that it's in drain_queue_helper() and returns to drain_queue_i(), logging that the helper returned 1. Since the last send() completed OK, this thread has no clue that the transport has been closed. It makes it back up to send_reply_message_i(), which enqueues the message and calls the flushing strategy's schedule_output() method. Schedule_output() manipulates the reactor's write mask with respect to the transport's socket, but that socket is no longer connected and its handler is no longer registered with the reactor. We believe this caused the strange "hanging" behavior - with the CPU at 100% in fstat(). Also, the message that was just queued is leaked when the Transport desructor runs (which caused an ASSERT in 1.4.3). The fix: Before this change, the return value from schedule_output was ignored. With this change, if -1 is returned, the message is dequeued and memory is reclaimed. Also with this change, the Transport's schedule_output_i (which is the implmenentation of schedule_output for the default and reactive flushing strategies) will attempt to locate this transport's connection handler in the reactor. If it's no longer registered, schedule_output_i will return -1. The Block flushing strategy already returns -1 if the connection is closed. This fix has been checked-in to CVS with the ChangeLog entry: Fri Mar 31 17:19:13 UTC 2006 Adam Mitz <mitza@ociweb.com> Testability (the ASSERT issue): I've created a regression test for this bug. I will be attaching the test to this Bugzilla entry. For the regression to fail (without the fix), test depends on the server aborting due to the ASSERT -- as it did in 1.4.3. Since both of the ASSERTs in the Transport destructor has been removed, it isn't possible to detect the bug. The other problem with the ASSERT being removed is that memory is being leaked when the ASSERT (would have) hit. Questions: 1. Should the ASSERTs in the Transport destructor be re-instated? If not, maybe we could add code to the Transport destructor to check for this condition and log it (with a given ORBDebugLevel) -- the regression test would parse the log for this message. Without one of these two options (or something else that someone could suggest), we don't have an automated regression test. 2. Should code be added to the Transport destructor to delete its queued messages? Memory may leak if there are similar problems that are not solved by the bugfix described above. Thanks, Adam
Created attachment 514 [details] Regression test (see next comment)
Attached is the regression test, unzip in $TAO_ROOT. The tarball contains a diff file, "Transport_cpp.diff", that needs to be applied to Transport.cpp in order for this test to work. With the patch applied, normal compiliation of Transport.cpp will give you the regression. Defining the preprocessor macro RT8431_FIX and recompiling Transport.cpp will show the test passing.
Hi Adam, ACE_ASSERT compiles away to nothing on ACE_NDEBUG builds right ? Does this mean your test would hang or just be a no-op on these builds ? I'd vote reinstate the assert if you need it and know where it needs to go - as you probably know we only removed the asserts because the coding guidelines said they shouldn't be there. We offered to just put them back but then there was some talk of replacing them with some sort of new uber assert but not until later on.
Simon, Thanks for your comments. If we reinstate the assert I'll add a comment pointing to the Bugzilla entry so that it's not removed again. If this problem were to re-appear, the debug=0 builds would not fail the regression test since, as you pointed out, the assert doesn't impact debug=0. I think this would be OK since the test failing every debug=1 build should be large enough "red flag". If there aren't any further comments I'll proceed with getting the test and the patch to Transport.cpp (adding the ACE_ASSERT line) checked-in. Thanks, Adam
I would also propose to add the asserts back and then the fix in. Btw, when the asserts are there, could you also rerun the test on cvs head without your patch to make sure that things wheren't fixed by accident in the doc_group version before.
I have been testing against the DOC group CVS "head". With the assert there (and debug=1) the test fails consistently pre-patch and passes consistently post-patch (where the "patch" is what I checked in last Friday).
Regression test and change to Transport.cpp (added asserts) has been checked in. Wed Apr 5 22:39:28 UTC 2006 Adam Mitz <mitza@ociweb.com> * tests/Bug_2494_Regression/Bug_2494_Regression.mpc: * tests/Bug_2494_Regression/README: * tests/Bug_2494_Regression/client.cpp: * tests/Bug_2494_Regression/run_test.pl: * tests/Bug_2494_Regression/server.cpp: * tests/Bug_2494_Regression/test.idl: * tests/Bug_2494_Regression/test_i.h: * tests/Bug_2494_Regression/test_i.inl: * tests/Bug_2494_Regression/test_i.cpp: Added a regression test for Bugzilla bug #2494. * tao/Transport.cpp (~Transport): Added back the ACE_ASSERTS, needed for the regression test above. See Bugzilla bug #2494.
scoreboard looks good
I am reopening this bug, see the data from the test stats below, this test is failing on the 12th on some linux builds, also the17th, but also today it is failing already in some linux builds 2006_04_17 29 6 TAO Linux TAO/tests/Bug_2494_Regression 2006_04_17 13 0 TAO Windows TAO/tests/Bug_2494_Regression 2006_04_17 7 0 TAO Solaris TAO/tests/Bug_2494_Regression 2006_04_12 30 5 TAO Linux TAO/tests/Bug_2494_Regression 2006_04_12 4 0 TAO NetBSD TAO/tests/Bug_2494_Regression 2006_04_12 12 0 TAO Windows TAO/tests/Bug_2494_Regression 2006_04_12 6 0 TAO Solaris TAO/tests/Bug_2494_Regression 2006_04_12 1 0 TAO HP-UX TAO/tests/Bug_2494_Regression 2006_04_26 14 2 TAO Linux TAO/tests/Bug_2494_Regression 2006_04_26 3 0 TAO Solaris TAO/tests/Bug_2494_Regression 2006_04_26 2 0 TAO Windows TAO/tests/Bug_2494_Regression
Here's what I see today (using the "clean builds" option): Date #Runs #Fails Type Group ========== ====== ====== ==== ==================== 2006_05_03 15 5 TAO Linux 2006_05_03 4 1 TAO Solaris It looks like it has improved since other bugs are being fixed, etc. I'm a little confused about the Solaris results. On the TAO scoreboard (http://www.dre.vanderbilt.edu/scoreboard/tao.html) there is only one Solaris build (Solaris10_Studio11_Debug) and it's passing this test. Which Solaris build is failing? Looking at the test matrix I see failures on these builds: FC4_x86_SCTP, Tiger_GCC4_Full The other builds that it fails have red lines all the way down so they have larger problems than this one test. There may be a timing issue with the test. We may need to adjust the timeout values so that it works on more platforms.
I'm going to try to resolve this again now that I've checked-in a change that should reduce the number of false negatives reported by the scoreboard. Tue Jun 13 13:51:11 UTC 2006 Adam Mitz <mitza@ociweb.com> * tests/Bug_2494_Regression/client.cpp: The scoreboard was marking this test as "failed" because it printed "EXCEPTION" (from the client process). This isn't a failure as long as the server process still exits cleanly. The test is now modified so that it doesn't print "EXCEPTION".