1020 – Client crash when TAO_Transport closes a connection, race condition

Bug 1020 - Client crash when TAO_Transport closes a connection, race condition

Summary: Client crash when TAO_Transport closes a connection, race condition

Status:	RESOLVED FIXED

Alias:	None

Product:	TAO
Classification:	Unclassified
Component:	ORB (show other bugs)
Version:	1.1.17
Hardware:	x86 Windows 2000

Importance:	P3 critical
Assignee:	Nanbor Wang

URL:

Duplicates (3):	1233 1322 1354 (view as bug list)
Depends on:	1031
Blocks:	943 1202
	Show dependency tree

Reported:	2001-09-06 13:22 CDT by nathalie
Modified:	2003-08-14 14:07 CDT (History)
CC List:	6 users (show)

See Also:

Attachments
tarred gzipped file (27.62 KB, application/octet-stream) 2002-07-09 10:56 CDT, Nanbor Wang	Details
patches for 1020 (5.61 KB, patch) 2002-11-06 14:27 CST, Nanbor Wang	Details
patches to exisitng files (17.14 KB, patch) 2002-11-11 16:39 CST, Nanbor Wang	Details
New files (9.62 KB, text/plain) 2002-11-11 16:39 CST, Nanbor Wang	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description nathalie 2001-09-06 13:22:32 CDT

When naming service is stopped, CosNaming::NamingContext::resolve() crashes on 
a client.  

Here's the call stack of the crash:
----------------------------------

ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> >::remove_handler
(ACE_Event_Handler * 0x02c50850, unsigned long 1023) line 368 + 12 bytes
ACE_Reactor::remove_handler(ACE_Event_Handler * 0x02c50850, unsigned long 1023) 
line 316
TAO_Transport::close_connection_i() line 697
TAO_Transport::close_connection() line 656
TAO_GIOP_Invocation::close_connection() line 536
TAO_ORB_Core::service_raise_comm_failure(TAO_GIOP_Invocation * 0x0184f6d0, 
TAO_Profile * 0x012cbeb0, CORBA_Environment & {...}) line 1501
TAO_GIOP_Synch_Invocation::invoke_i(unsigned char 0, CORBA_Environment & {...}) 
line 768 + 41 bytes
TAO_GIOP_Twoway_Invocation::invoke(TAO_Exception_Data * 0x69494060 
_tao_CosNaming_NamingContext_resolve_exceptiondata, unsigned int 3, 
CORBA_Environment & {...}) line 910 + 17 bytes
CosNaming::_TAO_NamingContext_Remote_Proxy_Impl::resolve(CORBA_Object * 
0x02c505e8, const CosNaming::Name & {...}, CORBA_Environment & {...}) line 2998 
+ 23 bytes
CosNaming::NamingContext::resolve(const CosNaming::Name & {...}, 
CORBA_Environment & {...}) line 5206
NAMING_CLIENT::NamingClient::resolveOrCreateContext(CosNaming::NamingContext * 
0x02c505d8, const char * 0x02cdd2a8) line 337 + 39 bytes
NAMING_CLIENT::NamingClient::set(const ACE_CString & {...}, CORBA_Object * 
0x02cb75d8) line 76 + 93 bytes
AdvertiseCommand::Execute_i() line 73 + 74 bytes
MethodRequest::Execute() line 280 + 13 bytes
ActiveObj::svc() line 84 + 8 bytes
ACE_Task_Base::svc_run(void * 0x01225428) line 203 + 11 bytes
ACE_Thread_Hook::start(void * (void *)* 0x69004ade ACE_Task_Base::svc_run(void 
*), void * 0x01225428) line 12 + 7 bytes
Thread_Hook::start(void * (void *)* 0x69004ade ACE_Task_Base::svc_run(void *), 
void * 0x01225428) line 19 + 17 bytes
ACE_Thread_Adapter::invoke_i() line 141 + 18 bytes
ACE_Thread_Adapter::invoke() line 91 + 11 bytes
ace_thread_adapter(void * 0x0123f208) line 127 + 10 bytes
_threadstartex(void * 0x0123f270) line 212 + 13 bytes
KERNEL32! 77e837cd()


Here are the local variables for TAO_Transport::close_connection_i() line 697:
-----------------------------------------------------------------------------

+	this	0x02c50960
+	i	0x00000003
-	eh	0x02c50850
+	__vfptr	0xdddddddd
	priority_	-572662307
+	reactor_	0xdddddddd

Comment 1 Nanbor Wang 2001-09-07 14:45:10 CDT

Changed  the summary. Looks like the following bug report from Bruce Trask is
also related. 

From: 
Bruce Trask <BTRASK@contactsystems.com>
  
TAO VERSION: 1.1.19
ACE VERSION: 5.1.19

HOST MACHINE and OPERATING SYSTEM: 
  Win 2K

TARGET MACHINE and OPERATING SYSTEM, if different from HOST:
  Win 2K
COMPILER NAME AND VERSION (AND PATCHLEVEL):
  Microsoft VC6 SP4

AREA/CLASS/EXAMPLE AFFECTED:
  Thread Pool Reactor

DOES THE PROBLEM AFFECT:
  COMPILATION?
    No.

  LINKING?
    No.

  EXECUTION?
    Yes. See below.
  OTHER (please specify)?


  SYNOPSIS:
  Client application core dumps when server goes away and comes back on.
Details below.


  DESCRIPTION:

  I have a client application (application A) that also functions as a
server.  This application uses the TP Reactor and spawns 3 threads and calls
orb->run() from them.  Here's the scenario, normally application A is in
communication with another machine running application B.  Application A
makes remote calls on application B.   During normal operation of
the system, there are times where application B will become disconnected, be
reset, or powered down and then come back up.  When application B comes back
up, he let's the event service app know that he is ready to go and the ES
informs application A that application B is up an running.  Application A
then tries to communicate with application B.

Then the problem occurs.  During its attempt to communicate with application
B this time, application A closes some connections and sometimes there is a
core dump.  
What I am seeing is this:



One thread is calling this function:

void
TAO_Transport::close_connection_i (void)
{
  ACE_Event_Handler *eh = 0;
  {
    ACE_MT (ACE_GUARD (ACE_Lock, guard, *this->handler_lock_));

    eh = this->event_handler_i ();

    this->transition_handler_state_i ();

    if (eh == 0)
      return;
  }

  // Close the underlying connection, it is enough to get an
  // Event_Handler pointer to do this, so we can factor out the code
  // in the base TAO_Transport class.


  // We first try to remove the handler from the reactor. After that
  // we destroy the handler using handle_close (). The remove handler
  // is necessary because if the handle_closed is called directly, the
  // reactor would be left with a dangling pointer.
  if (this->ws_->is_registered ())
    {
      this->orb_core_->reactor ()->remove_handler (
          eh,
          ACE_Event_Handler::ALL_EVENTS_MASK |
          ACE_Event_Handler::DONT_CALL);
    }

  (void) eh->handle_close (ACE_INVALID_HANDLE,
                           ACE_Event_Handler::ALL_EVENTS_MASK);

  for (TAO_Queued_Message *i = this->head_; i != 0; i = i->next ())
    {
      i->state_changed (TAO_LF_Event::LFS_CONNECTION_CLOSED);
    }
}
  if (this->ws_->is_registered ())
    {
      this->orb_core_->reactor ()->remove_handler (
          eh,
          ACE_Event_Handler::ALL_EVENTS_MASK |
          ACE_Event_Handler::DONT_CALL);
    }

  (void) eh->handle_close (ACE_INVALID_HANDLE,
                           ACE_Event_Handler::ALL_EVENTS_MASK);

  for (TAO_Queued_Message *i = this->head_; i != 0; i = i->next ())
    {
      i->state_changed (TAO_LF_Event::LFS_CONNECTION_CLOSED);
    }
}

this function is called as part of the following call stack
TAO_Transport::close_connection_i()
TAO_Transport::close_connection()
TAO_GIOP_Invocation::close_connection()
TAO_ORB_Core::service_raise_comm_failure()
TAO_GIOP_Synch_Invocation::invoke_i()
TAO_GIOP_Twoway_Invocation::invoke()
...
application called function.


meanwhile,at the same time,  for some reason another thread is executing and
here is its call stack (pseudo names)

ACE_Svc_Handler::destroy()
TAO_IIOP_Connection_Handler::handle_close()
ACE_Select_Reactor_Handle_Repository::unbind()
ACE_Select_Reactor_T::remove_handler_i()
ACE_Select_Reactor_T::remove_handler()
ACE_TP_Reactor::notify_handle()
ACE_TP_Reactor::handle_events()
ACE_Reactor::handle_events()
TAO_ORB_Core::run()
CORBA_ORB::run()

and it is during the call to destroy that the event handler commits suicide
(delete this).
template <PR_ST_1, ACE_SYNCH_DECL> void
ACE_Svc_Handler<PR_ST_2, ACE_SYNCH_USE>::destroy (void)
{
  ACE_TRACE ("ACE_Svc_Handler<PR_ST_2, ACE_SYNCH_USE>::destroy");

  // Only delete ourselves if we're not owned by a module and have
  // been allocated dynamically.
  if (this->mod_ == 0 && this->dynamic_ && this->closing_ == 0)
    // Will call the destructor, which automatically calls <shutdown>.
    // Note that if we are *not* allocated dynamically then the
    // destructor will call <shutdown> automatically when it gets run
    // during cleanup.
    delete this;
}

What seems to be happening is that there is a race between the two threads
above and there are times
where the first thread gets to the eh->handle_close() in the first code
snippet but by the time it gets there 
the eh's guts have already been deleted by the delete this in the other
thread.  Most times, everything works OK
and I don't get this situation but intermittently I do see the crash.  It
seems to happen if this mystery-call
to destroy happens after the first scoped section in the close_connnection_i
function above. Most times it happens before
so it is OK in which case eh is 0 and the close_connection_i function just
returns.

So, my questions are:
1.  Anyone else seen anything like this?
2.  I can't seem to determine what is setting off this second call stack
from the handle_events() that ultimately calls delete this on the event
handler that is being used by the other thread. It seems to come out of
nowhere.  Anything I'm missing here?
3.  For lack of knowing why the second thread is doing what its doing, it
seems that there may be more locking needed
to protect against the event handler being ripped out from underneath the
close_connection call. Am I correct?

I'm still looking into this one but wanted a sanity check from the
community. Thanks in advance for any input.

Here is my diagnosis of the problem. This is part of the mail that I exchanged
with Chris & Carlos. My mails point this as the reason
------------------------------------------------------------
Carlos & Chris-
Can you guys please look at bug reports 1020 & the one Bruce Trask (in
ace-users) sent in yesterday. The problem for both the cases is  from
here

void
TAO_Transport::close_connection_i (void)
{
.......
 // Close the underlying connection, it is enough to get an
  // Event_Handler pointer to do this, so we can factor out the code
  // in the base TAO_Transport class.


  // We first try to remove the handler from the reactor. After that
  // we destroy the handler using handle_close (). The remove handler
  // is necessary because if the handle_closed is called directly, the
  // reactor would be left with a dangling pointer.
  if (this->ws_->is_registered ())
    {
      this->orb_core_->reactor ()->remove_handler (
          eh,
          ACE_Event_Handler::ALL_EVENTS_MASK |
          ACE_Event_Handler::DONT_CALL);
    }

  (void) eh->handle_close (ACE_INVALID_HANDLE,
                           ACE_Event_Handler::ALL_EVENTS_MASK);

........

}

The problem as I see it is, one thread that works on close_connection_i
() and the other is in handle events () trying to dispatch for a
failure..  One of the threads in the above function,  marks handle for
deletion but it is not deleted yet. Before the thread gets to
handle_close () the thread in the reactor wakes up and calls
handle_close (). The thread in the above method is left high and dry.

The solution would be to allow the thread to do everything in one shot
ie. just call remove_handler () without any mask.  That way, the reactor
would hold the token before doing any changes to the internals. The
second thread wouldnt be dispatched as the thread in the above function
holds the token.

IMHO, the code should read

{...

if (this->ws_->is_registered ())
    {
      this->orb_core_->reactor ()->remove_handler (
          eh);
    }
else
  {

      //  For Wait on RW case
  (void) eh->handle_close (ACE_INVALID_HANDLE,
                           ACE_Event_Handler::ALL_EVENTS_MASK);
    }

....
}

---------------------------------------------------------------------------
A more detailed mail was also sent. The problem is a bit more complex than I
first thought.   This is thought process on this

----------------------------------------------------------------------------

Ok, here is the problem again. One of the threads tried making an invocation,
found that the connection is useless, tries closing the connection. At
the same moment another thread waiting on select () received the message
and goes to close connection. The thread in the reactor, does these 

(the thread on select holds the reactors token)
- unblocked from select () because of closed connection
- picks up the handle from the map that needs attention
- suspends the handle so that other threads dont dispatch 
  to the same handle.
- releases the token, (this is a very important step)
- calls handle_* method, which in this case is first a handle_input ()
  and then a remove_handler ().

Let us now look at the thread who is going to close connection from 
TAO_Transport::close_connection_i ():

- Calls remove_handler () on the reactor
- grabs the internal token to the reactor, (very important step)
- the remove_handler () calls unbind () in its internal map.
- the unbind () call does

  * clears out the bit for the handle in the wait set 
    and suspend set
  * marks state changed in the reactor
  * calls handle_close (). Note this is an upcall.
  * and then some shutdown related stuff which is not important for
    the discussion

Ok, now imagine this race condition. The second thread from 
close_connection_i (), calling handle_close () from unbind (), 
and the  thread in the reactor trying to make an upcall. The upcall 
can potentally fail, because the handle ccould have been deleted if
the other thread was faster. The problem will not appear if the thread 
in the reactor has gone inside handle_input ().  Then we have sufficient 
mechanisms to keep the handler alive. But just imagine a case when one 
thread is done with handle_close () and the other thread trying to make an 
upcall -- because the thread in the reactor thought it was safe. As a 
matter of fact it is.

The problem IMHO is kicking in from the thread that comes from TAO's
close_connection_i (). For me the code in unbind () is simply broken

int
ACE_Select_Reactor_Handler_Repository::unbind (ACE_HANDLE handle,
                                               ACE_Reactor_Mask mask)
{
....

  // Clear out the <mask> bits in the Select_Reactor's wait_set.
  this->select_reactor_.bit_ops (handle,
                                 mask,
                                 this->select_reactor_.wait_set_,
                                 ACE_Reactor::CLR_MASK);

  // And suspend_set.
  this->select_reactor_.bit_ops (handle,
                                 mask,
                                 this->select_reactor_.suspend_set_,
                                 ACE_Reactor::CLR_MASK);

  // Note the fact that we've changed the state of the <wait_set_>,
  // which is used by the dispatching loop to determine whether it can
  // keep going or if it needs to reconsult select().
  this->select_reactor_.state_changed_ = 1;
// Close down the <Event_Handler> unless we've been instructed not
  // to.
  if (ACE_BIT_ENABLED (mask, ACE_Event_Handler::DONT_CALL) == 0)
    eh->handle_close (handle, mask);

....
}

Why are we calling handle_close () on an handler without even checking
whether the handler is suspended or not? Can we add a check? We can,
but that will not solve the issue. If at all the handler is suspended
what do we do? Just get out without closing the handler? May be  that
will stop the crash but there *could* be a leak.

Comment 2 Nanbor Wang 2001-09-07 14:51:07 CDT

Here is another way we could try fixing the problem. 

- overload the remove_handler () call in the reactor.
- if the handle that needs to be removed is suspended, then we know in the
TP_Reactor that a thread is dispatching to that handler.  We just mark the
handle as "Need to be removed". 
- After the dispatching thread returns, use the same thread that did dispatching
to remove the handler

Here are some caveats

* We need to maintian a handle set in the TP_Reactor
*  A thread after dispatching needs to grab the token before seeing whether the
handle that it has just dispatched  needs to be removed.  This would mean waking
up another thread.  Else we can protect the handle set using some light-weight
mechanism.   This could make things a bit slower.

Comment 3 btrask 2001-09-09 09:43:33 CDT

I modified the /TAO_ROOT/tests/MT_Server test to duplicate the bug.  I tried
to put the file up as an attachment to this bug but there was some error.  I 
have a mail out to the webmaster to help.  In the meantime, here is the file's 
contents.  It is a slightly modified version of the client.cpp in the MT_Server 
test.  Just replace client.cpp with this one and run the server and then the 
client on Win2K and the problem should occur.

// client.cpp,v 1.3 2000/09/13 22:22:36 coryan Exp

#include "ace/Get_Opt.h"
#include "ace/Task.h"
#include "testC.h"

ACE_RCSID(MT_Server, client, "client.cpp,v 1.3 2000/09/13 22:22:36 coryan Exp")

const char *ior = "file://test.ior";
int niterations = 5;
int do_shutdown = 0;

int
parse_args (int argc, char *argv[])
{
  ACE_Get_Opt get_opts (argc, argv, "xk:i:");
  int c;

  while ((c = get_opts ()) != -1)
    switch (c)
      {
      case 'x':
        do_shutdown = 1;
        break;

      case 'k':
        ior = get_opts.optarg;
        break;

      case 'i':
        niterations = ACE_OS::atoi (get_opts.optarg);
        break;

      case '?':
      default:
        ACE_ERROR_RETURN ((LM_ERROR,
                           "usage:  %s "
                           "-k <ior> "
                           "-i <niterations> "
                           "\n",
                           argv [0]),
                          -1);
      }
  // Indicates sucessful parsing of the command line
  return 0;
}

class Worker : public ACE_Task_Base
{
  // = TITLE
  //   Run a server thread
  //
  // = DESCRIPTION
  //   Use the ACE_Task_Base class to run server threads
  //
public:
  Worker (CORBA::ORB_ptr orb);
  // ctor

  virtual int svc (void);
  // The thread entry point.

private:
  CORBA::ORB_var orb_;
  // The orb
};

int
main (int argc, char *argv[])
{

  ACE_TRY_NEW_ENV
    {
      CORBA::ORB_var orb =
        CORBA::ORB_init (argc, argv, "", ACE_TRY_ENV);
      ACE_TRY_CHECK;

      if (parse_args (argc, argv) != 0)
        return 1;


	  //////////////////////////////////////

	 // Fire up a few threads which are driving handle_events. This is is 
just to cause the bug 1020	  
      Worker worker (orb.in ());
      if (worker.activate (THR_NEW_LWP | THR_JOINABLE,
                           3) != 0)
        ACE_ERROR_RETURN ((LM_ERROR,
                           "Cannot activate client threads\n"),
                          1);


while(1)
{
	  ///////////////////////////////////////
  try
  {
      CORBA::Object_var object =
        orb->string_to_object (ior, ACE_TRY_ENV);
      ACE_TRY_CHECK;

      Simple_Server_var server =
        Simple_Server::_narrow (object.in (), ACE_TRY_ENV);
      ACE_TRY_CHECK;

      if (CORBA::is_nil (server.in ()))
        {
          ACE_ERROR_RETURN ((LM_ERROR,
                             "Object reference <%s> is nil\n",
                             ior),
                            1);
        }


      for (int i = 0; i != niterations; ++i)
        {
          CORBA::Long r =
            server->test_method (i, ACE_TRY_ENV);
          ACE_TRY_CHECK;

		// This is is just to cause the bug 1020
		  if ((i%3) == 0 && i!=0)
		  {
			  server->shutdown (ACE_TRY_ENV);
              ACE_TRY_CHECK;
		  }

          if (r != i)
            {
              ACE_DEBUG ((LM_DEBUG,
                          "(%P|%t) unexpected result = %d for %d",
                          r, i));
            }
        }

      if (do_shutdown)
        {
          server->shutdown (ACE_TRY_ENV);
          ACE_TRY_CHECK;
        }
  }
  catch(CORBA::SystemException& ex)
  {
  }
}


    }
  ACE_CATCHANY
    {
      ACE_PRINT_EXCEPTION (ACE_ANY_EXCEPTION,
                           "Exception caught:");
      return 1;
    }
  ACE_ENDTRY;

  return 0;
}

// ****************************************************************

Worker::Worker (CORBA::ORB_ptr orb)
  :  orb_ (CORBA::ORB::_duplicate (orb))
{
}

int
Worker::svc (void)
{
  ACE_DECLARE_NEW_CORBA_ENV;
  ACE_TRY
    {
      this->orb_->run (ACE_TRY_ENV);
      ACE_TRY_CHECK;
    }
  ACE_CATCHANY
    {
    }
  ACE_ENDTRY;
  return 0;
}

Hope this helps, 
Bruce
btrask@contactsystems.com

Comment 4 Irfan Pyarali 2001-09-17 14:03:17 CDT

Accepted.

Comment 5 Nanbor Wang 2001-09-25 15:57:41 CDT

Assigning the bug to me

Comment 6 Nanbor Wang 2001-09-25 15:58:14 CDT

Accepting the bug

Comment 7 Nanbor Wang 2002-06-20 13:13:16 CDT

*** Bug 1233 has been marked as a duplicate of this bug. ***

Comment 8 Nanbor Wang 2002-07-09 10:56:49 CDT

Created attachment 130 [details]
tarred gzipped file

Comment 9 Nanbor Wang 2002-07-09 13:04:43 CDT

The last attachemnt is for bug 1020 only. The name of the file is misleading. 
Forget the name of the file, the contents are for this bug only.

Comment 10 Carlos O'Ryan 2002-08-10 21:32:51 CDT

Crashes in the ORB are blockers to the 1.3 release

Comment 11 Nanbor Wang 2002-10-18 09:46:08 CDT

*** Bug 1322 has been marked as a duplicate of this bug. ***

Comment 12 Nanbor Wang 2002-11-06 13:48:21 CST

Added tao-support to the CC list

Comment 13 Nanbor Wang 2002-11-06 14:26:10 CST

Some sort of break through with this bug. A quick background

Problem:
--------

When you have a multi-threaded client communicating with a server, and if the
server happens to die the client crashed horribly.

Reason:
-------
When the server crashed, and if the situation is like this 

thread 1: Trying to write a request
thread 2...X: In the reactor driving the handle_events () loop

we got this crash. The real reason behind this is that "thread 1" tried closing
connections and one of the threads (2 through X) was woken up by the reactor to
close the same connection. So we had two threads trying to close connections
without any synchronization. Any usage of synchronization primitives, like
mutex., refcount or whatever, lead to a chicken and egg problem. This invariably
used to crash.

Proposed Solution: 
------------------
Had the following guiding principle

1. Only one thread will close the connection 
2. The other thread should know that and just leave without trying to close again. 

The fix leveraged  the existing fixes from  BUG 1269 and BUG 1270.

In the scenario presented above, instead of allowing thread 1 to close the
connection, we now allow the thread to call handle_events () in the reactor with
a zero timeout value in a loo. After every call to handle_events () the status
of the connection (as maintained by the connection handler) is checked.  Doing
this helped to achieve the following

1. If there had been another thread (2 through X) dispatched to close the
connection, then "thread 1" will not get the connection since the handle is
suspended when dispatch occurs

2. If threads (2 through X) actually closed the connection, "thread 1" can
monitor the status after the call to handle_events ()

3. "thread 1" will not block since it uses a zero timeout

4. If there were no threads other "thread 1" in the Reactor, "thread 1" will get
to know the closure event first and it woulds proceed to close the connection.
On return from the call to handle_events () "thread 1" would check the status
and then exit 

I have been testing this with $TAO_ROOT/tests/Bug_1020_Basic_Regression on
charanga.cs and tao.isis. Seems to be going well so far. I have to test this
more with other tests that have been reported in the past. I will attach patches
to what is described above shortly. 

Shortcoming in the patch: 
-------------------------

Need to add a check which prevents "thread 1" from calling handle_events () if
the handle has not been registered with the Reactor. 

Side effect:
------------
Needed to touch TAO_LF_Event class to show the state changes. 


Purpose of this entry:
----------------------
Need feedback from all qaurters on whether this approach is right or am I
missing something very very very obvious. Critical comments are welcome.

Comment 14 Nanbor Wang 2002-11-06 14:27:50 CST

Created attachment 156 [details]
patches for 1020

Comment 15 Nanbor Wang 2002-11-07 13:22:45 CST

Just a point. Using Leader_Follower instead of the reactor would be better. Use
the same idea and use the LF instead of Reactor.

Comment 16 Nanbor Wang 2002-11-08 21:41:56 CST

*** Bug 1354 has been marked as a duplicate of this bug. ***

Comment 17 Nanbor Wang 2002-11-11 16:39:11 CST

Created attachment 159 [details]
patches to exisitng files

Comment 18 Nanbor Wang 2002-11-11 16:39:55 CST

Created attachment 160 [details]
New files

Comment 19 Nanbor Wang 2002-11-11 16:54:09 CST

The patches in 
http://deuce.doc.wustl.edu/bugzilla/showattachment.cgi?attach_id=159

actually show the diffs between existing files. What has been done

1. The transport on detecting a fault in its output path calls a leader follower
with a timeout. (The timeout can be made an ORB option if needed). 

2. The LF_Event class has been changed considerably. We mixed the state changes
in the invocation path with the connection establishment path. Since it was
confusing the TAO_LF_Event class is now made a abstract base class.

3. Concrete classes, TAO_LF_Invocation_Event which does the work done in
TAO_LF_Event, and TAO_LF_CH_Event which takes care of non-blocking connection
establishment and termination in case of failures of established connections,
are now added. 

4. The new files are in 
 
http://deuce.doc.wustl.edu/bugzilla/showattachment.cgi?attach_id=160

These are the changes that are suggested for this bug. The
Bug_1020_Basic_Regression now works fine. There is yet another problem that this
bug opened up which showed up from Veritas test cases. There is going to be
seperate report on that.

Comment 20 Nanbor Wang 2002-11-30 22:55:18 CST

A work around has been added to TAO 1.2.6 and here is the relevant ChangeLog 
entry

Tue Nov 12 12:47:31 2002  Balachandran Natarajan  <bala@isis-server.isis.vanderb
ilt.edu>

There are three tests in TAO that pass. I am not closing this bug since this is 
a workaround.

Comment 21 Irfan Pyarali 2003-08-14 14:07:43 CDT

Fixed. This ChangeLogTag: 

Mon Jul 07 18:00:38 2003  Irfan Pyarali  <irfan@oomworks.com>

describes the proper fixes to these problems.