Bug 2074

Summary: Assertion failure in Transport_Connector.cpp
Product: TAO Reporter: hancockr
Component: ORBAssignee: DOC Center Support List (internal) <tao-support>
Status: RESOLVED FIXED    
Severity: critical CC: bala, hancock
Priority: P3    
Version: 1.4.4   
Hardware: x86   
OS: Linux   
Attachments: Test program to duplicate the assertion failure

Description hancockr 2005-03-26 17:07:46 CST
I've observed an assertion failure in the Transport_Connector.class sometimes
when a remote server has gone down and can no longer be connected to. The stack
trace below is one example where we are trying to narrow a reference to the
naming service on a remote machine.

#0  0xffffe002 in ?? ()
#1  0x42028c55 in abort () from /lib/tls/libc.so.6
#2  0x4156e40f in ACE_Log_Msg::log(char const*, ACE_Log_Priority, char*)
(this=0x916f2f8, format_str=0x413fc77a "", log_priority=128, argp=0x4efef3fc
"\001") at OS_NS_stdlib.inl:40
#3  0x4156e077 in ACE_Log_Msg::log(ACE_Log_Priority, char const*, ...)
(this=0x916f2f8, log_priority=128, format_str=0x413fc740 "ACE_ASSERT: file %N,
line %l assertion failed for '%s'.%a\n") at Log_Msg.cpp:912
#4  0x413e50ce in
TAO_Connector::check_connection_closure(TAO_Connection_Handler*)
(this=0x421328d4, connection_handler=0x8e894c4) at Transport_Connector.cpp:484
#5  0x413e4e91 in
TAO_Connector::wait_for_connection_completion(TAO::Profile_Transport_Resolver*,
TAO_Transport*&, ACE_Time_Value*) (this=0x89a5f20, r=0x4efef738,
transport=@0x4efef4b8, timeout=0x4efef730) at Transport.inl:173
#6  0x41349c0d in
TAO_IIOP_Connector::make_connection(TAO::Profile_Transport_Resolver*,
TAO_Transport_Descriptor_Interface&, ACE_Time_Value*) (this=0x89a5f20,
r=0x4efef738, desc=@0x4efef678, timeout=0x4efef730) at IIOP_Connector.cpp:219
#7  0x413e4d6c in TAO_Connector::connect(TAO::Profile_Transport_Resolver*,
TAO_Transport_Descriptor_Interface*, ACE_Time_Value*) (this=0x89a5f20,
r=0x4efef738, desc=0x4efef678, timeout=0x4efef730) at Transport_Connector.cpp:267
#8  0x413b14d9 in
TAO::Profile_Transport_Resolver::try_connect(TAO_Transport_Descriptor_Interface*,
ACE_Time_Value*) (this=0x4efef738, desc=0x4efef678, max_time_value=0x4efef730)
at Endpoint.i:27
#9  0x41359438 in
TAO_Default_Endpoint_Selector::select_endpoint(TAO::Profile_Transport_Resolver*,
ACE_Time_Value*) (this=0x8351b50, r=0x4efef738, max_wait_time=0x4efef730) at
Invocation_Endpoint_Selectors.cpp:58
#10 0x413b12c7 in TAO::Profile_Transport_Resolver::resolve(ACE_Time_Value*)
(this=0x4efef738, max_time_val=0x4efef730) at Profile_Transport_Resolver.cpp:83
#11 0x41357d35 in TAO::Invocation_Adapter::invoke_remote_i(TAO_Stub*,
TAO_Operation_Details&, TAO_Pseudo_Var_T<CORBA::Object>&, ACE_Time_Value*&)
(this=0x4efef8d8, stub=0x8beb160, details=@0x4efef818,
effective_target=@0x4efef7b8, max_wait_time=@0x4efef7b4) at
Invocation_Adapter.cpp:248
#12 0x4135799f in TAO::Invocation_Adapter::invoke_i(TAO_Stub*,
TAO_Operation_Details&) (this=0x4efef8d8, stub=0x8beb160, details=@0x4efef818)
at Invocation_Adapter.cpp:83
#13 0x413577bc in TAO::Invocation_Adapter::invoke(TAO::Exception_Data*, unsigned
long) (this=0x4efef8d8, ex_data=0x0, ex_count=0) at Invocation_Adapter.cpp:44
#14 0x413b3d48 in TAO::Remote_Object_Proxy_Broker::_is_a(CORBA::Object*, char
const*) (this=0x414962e0, target=0x0, type_id=0x0) at
Remote_Object_Proxy_Broker.cpp:38
#15 0x4137618d in CORBA::Object::_is_a(char const*) (this=0x8bd2068,
type_id=0x409b4620 "IDL:omg.org/CosNaming/NamingContext:1.0") at Object.cpp:216
#16 0x409a6ee0 in CosNaming::NamingContext::_narrow(CORBA::Object*)
(_tao_objref=0x8bd2068) at Object_T.cpp:27

The code involved is as follows:

      closed = connection_handler->is_closed ();

      // If closed, there is nothing to do here.  If not closed,
      // it was either opened or is still pending.
      if (!closed)
        {
          // Check if the handler has been opened.
          const bool open = connection_handler->is_open ();

          // Some other thread was able to open the handler even
          // though wait failed for this thread.
          if (open)
            {
              // Set the result to 0, we have an open connection
              result = 0;
            }
          else
            {
              // Assert that it is still connecting.
              ACE_ASSERT (connection_handler->is_connecting ());

              // Force close the handler now.
              connection_handler->close_handler ();
            }
        }

My readng of this code is rather naive, not knowing much about it, but couldn't
the connection state change from LFS_CONNECTION_WAIT (i.e. is_connecting) to
CONNECTION_CLOSED after the initial check for is_closed, causing the assert to
fail? It seems to me like it should just do the close_handler unconditionally -
I'm tempted to try just commenting out the assert and seeing what happens..
Comment 1 hancockr 2005-03-26 17:08:30 CST
Whoops, forgot the actual assertion failure message:

ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for
'connection_handler->is_connecting ()'.Aborting...
Comment 2 Nanbor Wang 2005-03-26 20:13:48 CST
Hi, could you please add an example client code that shows this problem for us
to test this issue a bit more?

Thanks
bala
Comment 3 hancockr 2005-03-26 20:38:41 CST
I'll try and come up with some code and/or steps to reproduce this - it will
likely be something like starting up a naming service, getting a reference to
it, then stopping the naming servce, calling some operation on the naming
service (which fails) and then trying to get a reference again. I'll have to
look at what our code is doing to see if the failed call and the re-narrow are
getting called on different threads - it's quite possible that's the case..
Comment 4 Robert Hancock 2005-03-28 13:12:02 CST
I have managed to create a small test program that more or less consistently
shows the problem on my test system. It appears that in fact the problem only
shows up when CORBA relative round-trip timeouts are in use (which our actual
system is in fact using). Basically all it does is start up 10 threads which
rapidly try to obtain and narrow a reference to a naming service on a machine
which is not reachable (the 10.175.12.99 in the code can be replaced with any
unreachable IP).

I built the program with:

g++ -g -I$ACE_ROOT -I$ACE_ROOT/TAO -I$ACE_ROOT/TAO/tao -I$ACE_ROOT/TAO/orbsvcs
nstest.cpp -L$ACE_ROOT/lib -lACE -lTAO -lTAO_CosNaming -lTAO_Messaging -o nstest

and when I run it I get output like this:

Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0)
Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0)
Caught exception TRANSIENT (IDL:omg.org/CORBA/TRANSIENT:1.0)
Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0)
ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for
'connection_handler->is_connecting ()'.Aborting...
Abort (core dumped)

The stack trace of the core file is the same as in the crash shown above. If I
take out the code that sets the round-trip timeouts or run only 1 thread of
this, I don't see the crash (or at least not nearly so quickly, I haven't tried
for very long like that).

Will attach the program momentarily..
Comment 5 Robert Hancock 2005-03-28 13:13:13 CST
Created attachment 310 [details]
Test program to duplicate the assertion failure
Comment 6 Robert Hancock 2005-03-28 13:20:01 CST
Adding my other email as CC
Comment 7 Johnny Willemsen 2005-04-15 04:09:03 CDT
accept on behalf of tao-support
Comment 8 Anders Olsson 2005-04-15 07:12:05 CDT
We have seen the same problem on Solaris, starting with 1.4.2, but it is also
present in 1.4.4. There appears to be a race condition, as it is more likely
to occur on multi-CPU machines.
The test program shows the bug, but when I changed the ACE_ASSERT into a simple
cout, like so: 
  if (! connection_handler->is_connecting ())
  {
    std::cout << "assertion failed for  connection_handler->is_connecting ():
found state: " << 
    connection_handler->get_state() << std::endl;
  }
 
Obviously, I had to add a small accessor get_state() to the Connection_Handler
to print the state at this time.
Running this modified code, at first I could not see the message on a
single-CPU machine (SunBlade1500).
However, running the test on a quad-CPU v440 shows the following messages
almost immediately:
%css3:/proj/N/ACE+TAO/ACE_wrappers/TAO>./a.out
assertion failed for connection_handler->is_connecting (): found state: 5
Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0)
Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0)
assertion failed for connection_handler->is_connecting (): found state: 5
Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0)
Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0)
Caught exception TIMEOUT (IDL:omg.org/CORBA/TIMEOUT:1.0)

So, the state of the connection_handler has is no longer LFS_CONNECTION_WAIT,
it has probably been changed to LFS_TIMEOUT by some other thread.
Either the ASSERT should check for this state too, or it could just be removed.
There shouldn't be any problem with closing a timed-out connection, should it?
Comment 9 hancockr 2005-04-15 09:03:26 CDT
We are using multiple CPUs in our case as well - either a single or dual Xeon
CPU, which are hyperthreaded, so either 2 or 4 logical CPUs..
Comment 10 Nanbor Wang 2005-04-17 05:04:02 CDT
My $0.02 based on what I have seen so fa

(1) The same problem should have been there in x.4.1 too, but it didn't show up
since the code path was slightly different. So it got masked. 

(2) The assert statement is required. If the condition for assert fails the call
to close () will read in FMR. I just tried a quick thingy by commenting out
assert and running valgrind. Here is a snippet of the error

--------------------------------------------------------------------------------
==16185== Thread 3:
==16185== Invalid read of size 4
==16185==    at 0x1BC4DEEB:
ACE_NonBlocking_Connect_Handler<TAO_IIOP_Connection_Handler>::handle_timeout(ACE_Time_Value
const&, void const*) (Connector.cpp:126)
==16185==    by 0x1BE5A075:
ACE_Event_Handler_Handle_Timeout_Upcall<ACE_Recursive_Thread_Mutex>::timeout(ACE_Timer_Queue_T<ACE_Event_Handler*,
ACE_Event_Handler_Handle_Timeout_Upcall<ACE_Recursive_Thread_Mutex>,
ACE_Recursive_Thread_Mutex>&, ACE_Event_Handler*, void const*, int,
ACE_Time_Value const&) (Timer_Queue_T.cpp:411)
==16185==    by 0x1BED9D19: ACE_TP_Reactor::handle_timer_events(int&,
ACE_TP_Token_Guard&) (Timer_Queue_T.inl:216)
..
------------------------------------------------------------------

(3) The problem stems from the fact that the follower got timed out but there is
a race between the leader and the follower in closing the handler. 

Just some thoughts on this problem. Will see how to address this soon
Comment 11 Michael R. Almond 2005-05-03 15:50:23 CDT
We have seen a similar problem with this assertion.  We're using RH EL 3.0
(latest RH updates applied).  The problem tends to show up if the network drops
out for a moment, or the system is under heavy load.

[New Thread -1145058384 (LWP 21704)]
[New Thread -1134568528 (LWP 21706)]
[New Thread -1124078672 (LWP 21707)]
[New Thread -1113588816 (LWP 21708)]
[New Thread -1103098960 (LWP 21709)]
[New Thread -1092609104 (LWP 21710)]
[New Thread -1082119248 (LWP 21711)]
ACE_ASSERT: file Transport_Connector.cpp, line 484 assertion failed for
'connection_handler->is_connecting ()'.Aborting...
Comment 12 Johnny Willemsen 2005-05-04 03:05:32 CDT
There will be some fixes for this in x.4.5 but we haven't got enough time to put
the regression test into the cvs archive and close this bug as really fixed.
Comment 13 Johnny Willemsen 2005-10-31 06:38:44 CST
Fix was in the archive a long time ago, just integrated the regression

Mon Oct 31 11:36:12 UTC 2005  Johnny Willemsen  <jwillemsen@remedy.nl>

        * tests/Bug_2074_Regression/*:
          Integrated regression for bug 2074