Bug 1265

Summary: Potential ORB deadlocks during shutdown
Product: TAO Reporter: Carlos O'Ryan <coryan>
Component: ORBAssignee: DOC Center Support List (internal) <tao-support>
Status: RESOLVED FIXED    
Severity: major    
Priority: P3    
Version: 1.2.3   
Hardware: All   
OS: All   
Bug Depends on:    
Bug Blocks: 1277    
Attachments: Full blown patch file for this bug.

Description Carlos O'Ryan 2002-08-04 13:26:52 CDT
The ORB can deadlock during a call to shutdown or destroy(), basically, a mutex
is held while the ORB destroys the POA, the POA calls application code during
its destructions (as servants are released), and that can call back to the ORB,
thus the deadlock.

The offending code is here:

------------------------------------
void
TAO_ORB_Core::shutdown (CORBA::Boolean wait_for_completion
                        ACE_ENV_ARG_DECL)
{
  ACE_GUARD (TAO_SYNCH_MUTEX, monitor, this->lock_);

  if (this->has_shutdown () == 0)
    {
------------------------------------

It looks like the intent was to make the operation a no-op if the shutdown()
flas was already set, that can easily be achieved by changing the code to:

------------------------------------------------------------------------
void
TAO_ORB_Core::shutdown (CORBA::Boolean wait_for_completion
                        ACE_ENV_ARG_DECL)
{
  {
    ACE_GUARD (TAO_SYNCH_MUTEX, monitor, this->lock_);
    if(this->has_shutdown () != 0)
       return;
    this->has_shutdown_ = 1;
  }

  // Put the rest of the code here...
}
------------------------------------------------------------------------

  a full-blown state machine for the shutdown process might be useful too...
Comment 1 Nanbor Wang 2002-08-04 13:41:59 CDT
Sounds about right. Could you checkin this fix once the beta goes out. If that 
may not be possible please let me know and I can add this to my list

Thanks
Comment 2 Ossama Othman 2002-08-04 19:13:29 CDT
Urm, please don't check in the fix just yet.  Other people, namely me, would 
like to discuss it.  :)

The only problem I see with that fix is that we'd be setting the ORB into 
a "shutdown" state before any of the actual shutdown code has run.  Suppose an 
exception is thrown before shutdown() is completed?  Is it reasonable to allow 
the application to retry the shutdown()?  If so, subsequent calls to shutdown
() will be no-ops.  The current code isn't all that great either since it sets 
the has_shutdown_ flag in the middle of the shutdown() code.

As an interim solution, can we add one or more ACE_Reverse_Locks to release 
the lock during any calls that potentially callback on the ORB instead?
Comment 3 Nanbor Wang 2002-08-04 23:01:47 CDT
I thought about having a reverse lock, but that would have race conditions. 
Imagine that you are leaving the lock to do adapter_registery_.close (), then 
it is possible that two threads could do the same thing which is "bad". There 
are ways to fix that, but the problem starts getting nasty!

If exception are thrown from the shutdown () call, it is better to leave stuff 
instead of calling it back again and have a core dump! If shutdown () had 4 
calls and during the first invocation, the fourth call threw an exception. 
Would you try invoking shutdown () again? :-). Would like to know your opinions 
on these!
Comment 4 Ossama Othman 2002-08-05 00:00:55 CDT
I wasn't implying that we should just add reverse locks blindly.  I'm well aware
of the race condition possibilities.  Let's assume that I've actually done some
multithreaded programming, run into deadlocks and race conditions and corrected
them.  The idea is to determine where and when reverse locks can be employed,
and if not why not.  Perhaps using the ORB lock is too coarse grained.  What
about using finer grained locks?  Also, Carlos' state machine idea sounds
interesting, too.

The argument about a seg fault occuring from calling shutdown() multiple times
is bogus.  Regardless of whether or not shutdown() is called multiple times
exception or not, shutdown() should never seg fault.  The ORB should be robust
enough to avoid such a fatal condition.
Comment 5 Carlos O'Ryan 2002-08-05 08:27:09 CDT
Created attachment 133 [details]
Full blown patch file for this bug.
Comment 6 Carlos O'Ryan 2002-08-05 08:46:44 CDT
Sorry, I added the attachment before seeing the discussion.

Ossama raises some interesting points, here are some comments:

1) Raising exceptions: the main concern is what happens if any of the operations
raises an exception during shutdown().  First that should not happen, we are
basically calling cleanup methods here, raising exceptions during cleanup is
evil (check Herb Sutter's "Exceptional C++" book for details.)  Basically
throwing exceptions in the middle of a cleanup operation is evil because there
is no way to complete the cleanup or backup the changes, so you are stuck with
half-destructed, inconsistent objects in your system, a worse situation than
having those exceptions ignored.

Now, if you *HAVE* to throw exceptions (for example because "now it is not a
right time to cleanup"), then you should check all the exceptional conditions
first and then start cleanup, that's what we do (I believe.)  Unfortunately that
is not enough, cleaning up the ORB involves calls to application code, where
some poor misguided soul could decided to throw.  IMHO we should enforce the "do
not throw" rules by having throw specs (to document our intent) and catching
exceptions (to protect ourselves against stupid compilers and users that
"accidentally" remove the throw specs.)

In short, "do not let exceptions escape from shutdown."

2) Finer grained locking: yes, this could solve my current problem.  But some
poor fool of a user will someday call shutdown() in the middle of shutdown(), no
matter how fine grained your locks you will deadlock in that case.  So I do not
like it as a general solution.

3) Reverse locks: race conditions aside (I assume we can solve them), I see no
advantage in this approach vs. my solution, and mine looks cleaner, but I'm
obviously biased here.

So my final vote is: use my patch but also enforce the exception rules during
cleanup.
Comment 7 Ossama Othman 2002-08-05 09:41:23 CDT
I agree that we shouldn't let exceptions escape from shutdown() once the
shutdown process has actually started.  If I understand you correctly, that
means we'll have to catch all exceptions thrown by TAO_Adapter_Registry::close()
and TAO_ORB_Core::destroy_interceptors(), or prevent them from throwing exceptions
in the first place (enforced/documented with a throw()), right?

I suppose users can always call POA::destroy() manually before ORB::shutdown()
if they want see any POA exceptions that occur during POA destruction.

Anyway, I'm okay with Carlos' new patch.  Thanks Carlos!
Comment 8 Carlos O'Ryan 2002-08-05 09:57:31 CDT
Just to answer Ossama's questions:

> I agree that we shouldn't let exceptions escape from shutdown()
> once the shutdown process has actually started.  If I understand
> you correctly, that means we'll have to catch all exceptions
> thrown by TAO_Adapter_Registry::close()

Yes.

> and TAO_ORB_Core::destroy_interceptors(),

Yes.

> or prevent them from throwing exceptions in the first place

Err, I would say "*AND* prevent them....", paranoia has payed out nicely in this
context.

> (enforced/documented with a throw()), right?

That too :-)

BTW, the question is now: what do we do when one of the application callbacks do
raise in a context where it is not allowed?  I would like to see such problems
logged, but using something that the application can hook to, so they (me :-)
can log them or handle them as they see fit.

This is a matter for a longer discussion, and I will bring it up in devo_group.
Comment 9 Nanbor Wang 2002-08-08 23:01:36 CDT
Accepting for tao-support. Can be fixed after the beta!
Comment 10 Carlos O'Ryan 2002-08-10 21:35:34 CDT
IMHO this is a blocker too, in any case it is very easy to fix since patches 
are already available.
Comment 11 Nanbor Wang 2002-08-27 18:59:43 CDT
Fixed! Please see 

Tue Aug 27 12:40:12 2002  Balachandran Natarajan 
<bala@isis-server.vuse.vanderbilt.edu>

for details.