Please report new issues athttps://github.com/DOCGroup
I've observed an assertion failure in the Transport_Connector.class sometimes when a remote server has gone down and can no longer be connected to. The stack trace below is one example where we are trying to narrow a reference to the naming service on a remote machine. #0 0xffffe002 in ?? () #1 0x42028c55 in abort () from /lib/tls/libc.so.6 #2 0x4156e40f in ACE_Log_Msg::log(char const*, ACE_Log_Priority, char*) (this=0x916f2f8, format_str=0x413fc77a "", log_priority=128, argp=0x4efef3fc "\001") at OS_NS_stdlib.inl:40 #3 0x4156e077 in ACE_Log_Msg::log(ACE_Log_Priority, char const*, ...) (this=0x916f2f8, log_priority=128, format_str=0x413fc740 "ACE_ASSERT: file %N, line %l assertion failed for '%s'.%a\n") at Log_Msg.cpp:912 #4 0x413e50ce in TAO_Connector::check_connection_closure(TAO_Connection_Handler*) (this=0x421328d4, connection_handler=0x8e894c4) at Transport_Connector.cpp:484 #5 0x413e4e91 in TAO_Connector::wait_for_connection_completion(TAO::Profile_Transport_Resolver*, TAO_Transport*&, ACE_Time_Value*) (this=0x89a5f20, r=0x4efef738, transport=@0x4efef4b8, timeout=0x4efef730) at Transport.inl:173 #6 0x41349c0d in TAO_IIOP_Connector::make_connection(TAO::Profile_Transport_Resolver*, TAO_Transport_Descriptor_Interface&, ACE_Time_Value*) (this=0x89a5f20, r=0x4efef738, desc=@0x4efef678, timeout=0x4efef730) at IIOP_Connector.cpp:219 #7 0x413e4d6c in TAO_Connector::connect(TAO::Profile_Transport_Resolver*, TAO_Transport_Descriptor_Interface*, ACE_Time_Value*) (this=0x89a5f20, r=0x4efef738, desc=0x4efef678, timeout=0x4efef730) at Transport_Connector.cpp:267 #8 0x413b14d9 in TAO::Profile_Transport_Resolver::try_connect(TAO_Transport_Descriptor_Interface*, ACE_Time_Value*) (this=0x4efef738, desc=0x4efef678, max_time_value=0x4efef730) at Endpoint.i:27 #9 0x41359438 in TAO_Default_Endpoint_Selector::select_endpoint(TAO::Profile_Transport_Resolver*, ACE_Time_Value*) (this=0x8351b50, r=0x4efef738, max_wait_time=0x4efef730) at Invocation_Endpoint_Selectors.cpp:58 #10 0x413b12c7 in TAO::Profile_Transport_Resolver::resolve(ACE_Time_Value*) (this=0x4efef738, max_time_val=0x4efef730) at Profile_Transport_Resolver.cpp:83 #11 0x41357d35 in TAO::Invocation_Adapter::invoke_remote_i(TAO_Stub*, TAO_Operation_Details&, TAO_Pseudo_Var_T<CORBA::Object>&, ACE_Time_Value*&) (this=0x4efef8d8, stub=0x8beb160, details=@0x4efef818, effective_target=@0x4efef7b8, max_wait_time=@0x4efef7b4) at Invocation_Adapter.cpp:248 #12 0x4135799f in TAO::Invocation_Adapter::invoke_i(TAO_Stub*, TAO_Operation_Details&) (this=0x4efef8d8, stub=0x8beb160, details=@0x4efef818) at Invocation_Adapter.cpp:83 #13 0x413577bc in TAO::Invocation_Adapter::invoke(TAO::Exception_Data*, unsigned long) (this=0x4efef8d8, ex_data=0x0, ex_count=0) at Invocation_Adapter.cpp:44 #14 0x413b3d48 in TAO::Remote_Object_Proxy_Broker::_is_a(CORBA::Object*, char const*) (this=0x414962e0, target=0x0, type_id=0x0) at Remote_Object_Proxy_Broker.cpp:38 #15 0x4137618d in CORBA::Object::_is_a(char const*) (this=0x8bd2068, type_id=0x409b4620 "IDL:omg.org/CosNaming/NamingContext:1.0") at Object.cpp:216 #16 0x409a6ee0 in CosNaming::NamingContext::_narrow(CORBA::Object*) (_tao_objref=0x8bd2068) at Object_T.cpp:27 The code involved is as follows: closed = connection_handler->is_closed (); // If closed, there is nothing to do here. If not closed, // it was either opened or is still pending. if (!closed) { // Check if the handler has been opened. const bool open = connection_handler->is_open (); // Some other thread was able to open the handler even // though wait failed for this thread. if (open) { // Set the result to 0, we have an open connection result = 0; } else { // Assert that it is still connecting. ACE_ASSERT (connection_handler->is_connecting ()); // Force close the handler now. connection_handler->close_handler (); } } My readng of this code is rather naive, not knowing much about it, but couldn't the connection state change from LFS_CONNECTION_WAIT (i.e. is_connecting) to CONNECTION_CLOSED after the initial check for is_closed, causing the assert to fail? It seems to me like it should just do the close_handler unconditionally - I'm tempted to try just commenting out the assert and seeing what happens..
Whoops, forgot the actual assertion failure message: ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for 'connection_handler->is_connecting ()'.Aborting...
Hi, could you please add an example client code that shows this problem for us to test this issue a bit more? Thanks bala
I'll try and come up with some code and/or steps to reproduce this - it will likely be something like starting up a naming service, getting a reference to it, then stopping the naming servce, calling some operation on the naming service (which fails) and then trying to get a reference again. I'll have to look at what our code is doing to see if the failed call and the re-narrow are getting called on different threads - it's quite possible that's the case..
I have managed to create a small test program that more or less consistently shows the problem on my test system. It appears that in fact the problem only shows up when CORBA relative round-trip timeouts are in use (which our actual system is in fact using). Basically all it does is start up 10 threads which rapidly try to obtain and narrow a reference to a naming service on a machine which is not reachable (the 10.175.12.99 in the code can be replaced with any unreachable IP). I built the program with: g++ -g -I$ACE_ROOT -I$ACE_ROOT/TAO -I$ACE_ROOT/TAO/tao -I$ACE_ROOT/TAO/orbsvcs nstest.cpp -L$ACE_ROOT/lib -lACE -lTAO -lTAO_CosNaming -lTAO_Messaging -o nstest and when I run it I get output like this: Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0) Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0) Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0) Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for 'connection_handler->is_connecting ()'.Aborting... Abort (core dumped) The stack trace of the core file is the same as in the crash shown above. If I take out the code that sets the round-trip timeouts or run only 1 thread of this, I don't see the crash (or at least not nearly so quickly, I haven't tried for very long like that). Will attach the program momentarily..
Created attachment 310 [details] Test program to duplicate the assertion failure
Adding my other email as CC
accept on behalf of tao-support
We have seen the same problem on Solaris, starting with 1.4.2, but it is also present in 1.4.4. There appears to be a race condition, as it is more likely to occur on multi-CPU machines. The test program shows the bug, but when I changed the ACE_ASSERT into a simple cout, like so: if (! connection_handler->is_connecting ()) { std::cout << "assertion failed for connection_handler->is_connecting (): found state: " << connection_handler->get_state() << std::endl; } Obviously, I had to add a small accessor get_state() to the Connection_Handler to print the state at this time. Running this modified code, at first I could not see the message on a single-CPU machine (SunBlade1500). However, running the test on a quad-CPU v440 shows the following messages almost immediately: %css3:/proj/N/ACE+TAO/ACE_wrappers/TAO>./a.out assertion failed for connection_handler->is_connecting (): found state: 5 Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) assertion failed for connection_handler->is_connecting (): found state: 5 Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0) So, the state of the connection_handler has is no longer LFS_CONNECTION_WAIT, it has probably been changed to LFS_TIMEOUT by some other thread. Either the ASSERT should check for this state too, or it could just be removed. There shouldn't be any problem with closing a timed-out connection, should it?
We are using multiple CPUs in our case as well - either a single or dual Xeon CPU, which are hyperthreaded, so either 2 or 4 logical CPUs..
My $0.02 based on what I have seen so fa (1) The same problem should have been there in x.4.1 too, but it didn't show up since the code path was slightly different. So it got masked. (2) The assert statement is required. If the condition for assert fails the call to close () will read in FMR. I just tried a quick thingy by commenting out assert and running valgrind. Here is a snippet of the error -------------------------------------------------------------------------------- ==16185== Thread 3: ==16185== Invalid read of size 4 ==16185== at 0x1BC4DEEB: ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_timeout(ACE_Time_Value const&, void const*) (Connector.cpp:126) ==16185== by 0x1BE5A075: ACE_Event_Handler_Handle_Timeout_Upcall<ACE_Recursive_Thread_Mutex>::timeout(ACE_Timer_Queue_T<ACE_Event_Handler*, ACE_Event_Handler_Handle_Timeout_Upcall<ACE_Recursive_Thread_Mutex>, ACE_Recursive_Thread_Mutex>&, ACE_Event_Handler*, void const*, int, ACE_Time_Value const&) (Timer_Queue_T.cpp:411) ==16185== by 0x1BED9D19: ACE_TP_Reactor::handle_timer_events(int&, ACE_TP_Token_Guard&) (Timer_Queue_T.inl:216) .. ------------------------------------------------------------------ (3) The problem stems from the fact that the follower got timed out but there is a race between the leader and the follower in closing the handler. Just some thoughts on this problem. Will see how to address this soon
We have seen a similar problem with this assertion. We're using RH EL 3.0 (latest RH updates applied). The problem tends to show up if the network drops out for a moment, or the system is under heavy load. [New Thread -1145058384 (LWP 21704)] [New Thread -1134568528 (LWP 21706)] [New Thread -1124078672 (LWP 21707)] [New Thread -1113588816 (LWP 21708)] [New Thread -1103098960 (LWP 21709)] [New Thread -1092609104 (LWP 21710)] [New Thread -1082119248 (LWP 21711)] ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for 'connection_handler->is_connecting ()'.Aborting...
There will be some fixes for this in x.4.5 but we haven't got enough time to put the regression test into the cvs archive and close this bug as really fixed.
Fix was in the archive a long time ago, just integrated the regression Mon Oct 31 11:36:12 UTC 2005 Johnny Willemsen <jwillemsen@remedy.nl> * tests/Bug_2074_Regression/*: Integrated regression for bug 2074