Bug 1269 - ORB crashes if peer dies while ORB is blocked trying to send requests
Summary: ORB crashes if peer dies while ORB is blocked trying to send requests
Status: RESOLVED FIXED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.2.3
Hardware: All All
: P3 critical
Assignee: DOC Center Support List (internal)
URL:
Depends on: 1305 1309
Blocks: 1202 1277
  Show dependency tree
 
Reported: 2002-08-05 15:14 CDT by Carlos O'Ryan
Modified: 2002-11-02 20:08 CST (History)
0 users

See Also:


Attachments
Regression test for this bug (tarred) (7.59 KB, application/octet-stream)
2002-08-05 15:57 CDT, Carlos O'Ryan
Details
Patches to the ORB core. (37.54 KB, patch)
2002-10-14 10:04 CDT, Carlos O'Ryan
Details
Patches to the protocols. (19.82 KB, patch)
2002-10-14 10:05 CDT, Carlos O'Ryan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos O'Ryan 2002-08-05 15:14:28 CDT
OK, we have been here before and fixed many of these bugs, but it is happening
again, obviously the regression tests are missing the problem and the connection
management is busted yet another time.

I will be adding a regression test shortly, but having the server crash in the
middle of a request like this:

// IDL
interface Echo
{
  void echo_payload(in Payload x);
};

is not good news.  I think it also crashes if the operation above is a oneway,
notice that the client has to be in the right state, i.e. blocked trying to
write or using the Reactor/handle_output() loop to send the data.  And it helps
if the error is detected by write() instead of read.

But at this point my analysis is probably premature (read that as "most likely
wrong"), the core files don't lie though :-)
Comment 1 Carlos O'Ryan 2002-08-05 15:57:13 CDT
Created attachment 134 [details]
Regression test for this bug (tarred)
Comment 2 Carlos O'Ryan 2002-08-05 15:58:10 CDT
Doh!  The repo is frozen, so I attached the regression test to the bug, will
commit once the beta is out and they thaw the repo.
Comment 3 Nanbor Wang 2002-08-20 11:19:22 CDT
Last heard that Carlos was looking to fix this. If not, we need to take care of 
this.
Comment 4 Carlos O'Ryan 2002-09-09 16:37:45 CDT
Adding dependency on 1305
Comment 5 Carlos O'Ryan 2002-10-14 10:04:37 CDT
Created attachment 150 [details]
Patches to the ORB core.
Comment 6 Carlos O'Ryan 2002-10-14 10:05:01 CDT
Created attachment 151 [details]
Patches to the protocols.
Comment 7 Carlos O'Ryan 2002-10-14 10:13:00 CDT
OK.  I attached two patches that fix this bug.
The first patch:

http://deuce.doc.wustl.edu/bugzilla/showattachment.cgi?attach_id=150

modifies the ORB core and solves the problem (at least as far as I can solve it.)
The second patch:

http://deuce.doc.wustl.edu/bugzilla/showattachment.cgi?attach_id=151

simply modifies the pluggable protocols to match the changes in the ORB Core, so
it needs no explanation.

As to the first patch, here are the changes in detail:

1) It eliminates the pending_upcall_ vs. refcount_ fields in the
Conneciton_Handler.  Having two reference counts is hard to debug and extremely
hard to get right.  It also makes it hard to state when the object is deleted,
hard to analyze the reference counting rules and it actually does not help with
anything I can see, so it is zapped.

2) The transport_ field in the Connection_Handler is atomically modified.

3) Closing connections is also atomic.

4) When a connection is closed *all* the activations in the Reactor are removed.

The last one is the really important change, but it does not help without (3).

I also documented the reference counting with REFCNT comments in the places
where it is incremented or decremented, that way we can analyze reference
counting statically, and convince ourselves that it is done right.

Please review the changes and let me know what do you think.  Be adviced, I do
not have much time to break the changes in smaller portions, so if there is
something you do not like you better change it yourselves.
Comment 8 Carlos O'Ryan 2002-10-22 13:28:46 CDT
Not mine anymore.  I submitted the patches and everything.  Returning to the
tao-support tarpit.
Comment 9 Nanbor Wang 2002-11-02 20:08:18 CST
Fixed! Details are available in

Mon Oct 21 22:45:02 2002  Balachandran Natarajan  <bala@isis-
server.isis.vanderbilt.edu>

I ran the tests for these for almost the past two days in various ways. The 
only problem that I have seen the test crash is because of stack overflow. With 
some aggressive testing over the past two days,  we can give some assurance 
that this is fixed.