Bug 1482

Summary:	TAO_Muxed_TMS::connection_closed () and AMI
Product:	TAO	Reporter:	Nanbor Wang <bala>
Component:	ORB	Assignee:	Nanbor Wang <bala>
Status:	RESOLVED FIXED
Severity:	normal
Priority:	P3
Version:	1.3.1
Hardware:	All
OS:	All
Attachments:	Test TAR for Bala, from CTS

Description Nanbor Wang 2003-04-03 10:53:58 CST

This is a bug report from Carlos at ATD

---------------------------------------------------------------------
----------------
Sun Dec 29 08:44:06 2002  Balachandran Natarajan  <bala@isis-server.isis.van\
derbilt.edu>

        * tao/Muxed_TMS.cpp:
        * tao/Muxed_TMS.h: Use recursive mutex instead of a simple
          mutex. This prevents deadlock that might occur, when a thread
          tries to send a request during reply dispatching.
----------------
I do not believe the fix is correct, or at least, it is not
immediately obvious that the fix is correct.  Consider the following
code:

----------------------------------------------------------------
     191 TAO_Muxed_TMS::connection_closed (void)
     192 {
     193   ACE_GUARD (TAO_SYNCH_RECURSIVE_MUTEX, ace_mon, this->lock_);
     194   // @@ This should be done using a mutex, the table
     195
     196   REQUEST_DISPATCHER_TABLE::ITERATOR end =
     197     this->dispatcher_table_.end ();
     198   for (REQUEST_DISPATCHER_TABLE::ITERATOR i =
     199          this->dispatcher_table_.begin ();
 200        i != end;
     201        ++i)
     202     {
     203       (*i).int_id_->connection_closed ();
     204     }
     205   this->dispatcher_table_.unbind_all ();
     206 }
     207
----------------------------------------------------------------

        In this code line 203 may invoke application code as part of
AMI reply handlers (to handle the exception.)  The application can do
anything crazy at that point, including making new requests.  At that
point, the iterator declared in line 198 may become invalidated.  If that 
happens the system may crash.

        I cannot convince myself that the problem is not real.  Can
anybody guide me there?  Also, I can easily see that only the thread
calling connection_closed() can manipulate the map.  But I cannot
convince myself that calling connection_closed() on the first reply
dispatcher won't result in any other reply dispatcher becoming
invalid.

        For example, what if the application runs an event loop while
in the middle of the first connection_closed() upcall?  What if all
the other reply dispatchers are AMI or DII deferred calls and they
time out during that nested event loop?  The dispatchers are deleted
as part of the nested loop and removed from the map.  That either invalidates 
the iterator or makes it possible to access a deleted
object.

        I thought about changing the code as follows:

----------------------------------------------------------------
     191 TAO_Muxed_TMS::connection_closed (void)
     192 {
     193   ACE_GUARD (TAO_SYNCH_RECURSIVE_MUTEX, ace_mon, this->lock_);
     194   REQUEST_DISPATCHER_TABLE tmp =
     195     this->dispatcher_table_;
     196   this->dispatcher_table_.unbind_all();
     197
     198   REQUEST_DISPATCHER_TABLE::ITERATOR end = tmp.end();
 199   for (REQUEST_DISPATCHER_TABLE::ITERATOR i = tmp.begin();
     200        i != end;
     201        ++i)
     202     {
     203       (*i).int_id_->connection_closed ();
     204     }
     205 }
----------------------------------------------------------------

        That deals with the iterators, but not with the deleted
objects.  There is the extra cost of line 194/195,  if the ACE hash
map supported a swap() operation ala STL (hint hint), we could
implement the function using:
----------------------------------------------------------------
     194   REQUEST_DISPATCHER_TABLE tmp;
     195   tmp.swap(this->dispatcher_table_);
----------------------------------------------------------------

Comment 1 Mark Bober 2003-04-03 11:59:33 CST

Created attachment 193 [details]
Test TAR for Bala, from CTS

Comment 2 Nanbor Wang 2003-04-03 13:27:05 CST

Fixed. Please see

Thu Apr  3 13:13:24 2003  Balachandran Natarajan  <bala@dre.vanderbilt.edu>

for details

Comment 3 Nanbor Wang 2003-10-01 15:52:48 CDT

Reopening the bug. The problem hasn't been completely fixed. I see deadlocks
still    in my branch.

Comment 4 Nanbor Wang 2003-10-31 10:41:41 CST

Fixed and this time for real!

Sat Oct  4 23:09:51 2003  Balachandran Natarajan  <bala@dre.vanderbilt.edu>