Summary: | Assertion failure in Transport_Connector.cpp | ||
---|---|---|---|
Product: | TAO | Reporter: | hancockr |
Component: | ORB | Assignee: | DOC Center Support List (internal) <tao-support> |
Status: | RESOLVED FIXED | ||
Severity: | critical | CC: | bala, hancock |
Priority: | P3 | ||
Version: | 1.4.4 | ||
Hardware: | x86 | ||
OS: | Linux | ||
Attachments: | Test program to duplicate the assertion failure |
Description
hancockr
2005-03-26 17:07:46 CST
Whoops, forgot the actual assertion failure message: ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for 'connection_handler->is_connecting ()'.Aborting... Hi, could you please add an example client code that shows this problem for us to test this issue a bit more? Thanks bala I'll try and come up with some code and/or steps to reproduce this - it will likely be something like starting up a naming service, getting a reference to it, then stopping the naming servce, calling some operation on the naming service (which fails) and then trying to get a reference again. I'll have to look at what our code is doing to see if the failed call and the re-narrow are getting called on different threads - it's quite possible that's the case.. I have managed to create a small test program that more or less consistently shows the problem on my test system. It appears that in fact the problem only shows up when CORBA relative round-trip timeouts are in use (which our actual system is in fact using). Basically all it does is start up 10 threads which rapidly try to obtain and narrow a reference to a naming service on a machine which is not reachable (the 10.175.12.99 in the code can be replaced with any unreachable IP). I built the program with: g++ -g -I$ACE_ROOT -I$ACE_ROOT/TAO -I$ACE_ROOT/TAO/tao -I$ACE_ROOT/TAO/orbsvcs nstest.cpp -L$ACE_ROOT/lib -lACE -lTAO -lTAO_CosNaming -lTAO_Messaging -o nstest and when I run it I get output like this: Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0) Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0) Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0) Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for 'connection_handler->is_connecting ()'.Aborting... Abort (core dumped) The stack trace of the core file is the same as in the crash shown above. If I take out the code that sets the round-trip timeouts or run only 1 thread of this, I don't see the crash (or at least not nearly so quickly, I haven't tried for very long like that). Will attach the program momentarily.. Created attachment 310 [details]
Test program to duplicate the assertion failure
Adding my other email as CC accept on behalf of tao-support We have seen the same problem on Solaris, starting with 1.4.2, but it is also present in 1.4.4. There appears to be a race condition, as it is more likely to occur on multi-CPU machines. The test program shows the bug, but when I changed the ACE_ASSERT into a simple cout, like so: if (! connection_handler->is_connecting ()) { std::cout << "assertion failed for connection_handler->is_connecting (): found state: " << connection_handler->get_state() << std::endl; } Obviously, I had to add a small accessor get_state() to the Connection_Handler to print the state at this time. Running this modified code, at first I could not see the message on a single-CPU machine (SunBlade1500). However, running the test on a quad-CPU v440 shows the following messages almost immediately: %css3:/proj/N/ACE+TAO/ACE_wrappers/TAO>./a.out assertion failed for connection_handler->is_connecting (): found state: 5 Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) assertion failed for connection_handler->is_connecting (): found state: 5 Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) So, the state of the connection_handler has is no longer LFS_CONNECTION_WAIT, it has probably been changed to LFS_TIMEOUT by some other thread. Either the ASSERT should check for this state too, or it could just be removed. There shouldn't be any problem with closing a timed-out connection, should it? We are using multiple CPUs in our case as well - either a single or dual Xeon CPU, which are hyperthreaded, so either 2 or 4 logical CPUs.. My $0.02 based on what I have seen so fa (1) The same problem should have been there in x.4.1 too, but it didn't show up since the code path was slightly different. So it got masked. (2) The assert statement is required. If the condition for assert fails the call to close () will read in FMR. I just tried a quick thingy by commenting out assert and running valgrind. Here is a snippet of the error -------------------------------------------------------------------------------- ==16185== Thread 3: ==16185== Invalid read of size 4 ==16185== at 0x1BC4DEEB: ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_timeout(ACE_Time_Value const&, void const*) (Connector.cpp:126) ==16185== by 0x1BE5A075: ACE_Event_Handler_Handle_Timeout_Upcall<ACE_Recursive_Thread_Mutex>::timeout(ACE_Timer_Queue_T<ACE_Event_Handler*, ACE_Event_Handler_Handle_Timeout_Upcall<ACE_Recursive_Thread_Mutex>, ACE_Recursive_Thread_Mutex>&, ACE_Event_Handler*, void const*, int, ACE_Time_Value const&) (Timer_Queue_T.cpp:411) ==16185== by 0x1BED9D19: ACE_TP_Reactor::handle_timer_events(int&, ACE_TP_Token_Guard&) (Timer_Queue_T.inl:216) .. ------------------------------------------------------------------ (3) The problem stems from the fact that the follower got timed out but there is a race between the leader and the follower in closing the handler. Just some thoughts on this problem. Will see how to address this soon We have seen a similar problem with this assertion. We're using RH EL 3.0 (latest RH updates applied). The problem tends to show up if the network drops out for a moment, or the system is under heavy load. [New Thread -1145058384 (LWP 21704)] [New Thread -1134568528 (LWP 21706)] [New Thread -1124078672 (LWP 21707)] [New Thread -1113588816 (LWP 21708)] [New Thread -1103098960 (LWP 21709)] [New Thread -1092609104 (LWP 21710)] [New Thread -1082119248 (LWP 21711)] ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for 'connection_handler->is_connecting ()'.Aborting... There will be some fixes for this in x.4.5 but we haven't got enough time to put the regression test into the cvs archive and close this bug as really fixed. Fix was in the archive a long time ago, just integrated the regression Mon Oct 31 11:36:12 UTC 2005 Johnny Willemsen <jwillemsen@remedy.nl> * tests/Bug_2074_Regression/*: Integrated regression for bug 2074 |