Bug 982

Summary: $TAO_ROOT/tests/Oneway_Buffering test fails
Product: TAO Reporter: Nanbor Wang <bala>
Component: ORBAssignee: DOC Center Support List (internal) <tao-support>
Status: ASSIGNED ---    
Severity: normal    
Priority: P3    
Version: 1.1.18   
Hardware: All   
OS: All   
Bug Depends on: 1031    
Bug Blocks:    

Description Nanbor Wang 2001-07-22 09:20:31 CDT
We have a test in TAO that has been failing ever since I checked in my
fixes for 575. I  have finally got a chance to figure out why it is
failing after running in different directions (after staring at things
constinously :( ) .  Here is my finding. I would like every one to go
through this and let me know how to deal with this problem (whether it
is an application problem or an ORB problem).

Ok, here is the outline.

- There is a guy called "admin" who just monitors stuff
- The server starts up
- The client is started.
- There is some communication to sync up the server, admin and client
(not really to our interest at this point)
- The client bangs buffered oneways to the server with a good amount of
payload
- The server after receieving a call just turns to the admin to say how
much of payload he received (this is a two way call)

Now to the actual problem.  When the client requests start banging in,
the server tries to turn around to make two way calls on the admin.
The server tries this during the process

- sends a call
- and  then goes to wait () on the LF

Before the  reply from the admin gets in, the reactor wakes up the
thread makes it a server to process more oneways (there is only one
thread in every process). This makes the server establish new
connections to the admin to send the payload information. This process
goes on ie, getting more messages and establishing new connections to
the admin. Somewhere along the line, the wait () reads the messages and
then the stack unwinds a bit. But establishing more connections keep
growing our connection cache. After that there is havoc. Linux dumps a core (we 
would be lucky if that doesnt happen), as things have become inconsistent, on 
Solaris things hang..  Looks like there is a memory corruption when we exceed 
the handle limits on Linux. We get SEGV's in assorted places. Sometimes the 
test hangs.

How did this work before 575? the answer is simple.  We never resumed
the handle during the upcall to the server (and in turn the remote call
to the admin). After our upcall is done we resumed the handle. So things
were happy as we proceeded to process the next call and so on..

OK, now to the dissection or post-mortem analysis
- Is the test wrong?
- One solution is to use a oneway to the admin from the server (I have
tested that and it works) , but  is that always feasible?
-  Another solution could be to use a RW connection handler on the
server
-  is something wrong with our ORB, I mean the connection purging,
resuming handles etc.?
-   where do such problems need to be solved, at the ORB level (by doing
some intelligent stuff, I dont know what now :() or application level
(by choosing the right configurations).

Any type of input on this would be great.
Comment 1 Nanbor Wang 2001-07-22 09:24:06 CDT
For the time being, I guess I am going to take a simple way out. I am going to 
change the operation request_received () in the interface 
Oneway_Buffering_Admin to a oneway. I know I am not solving the problem but 
just trying to get around it. Still the question remains ie. is this an ORB 
problem and where should this be dealt with?
Comment 2 Nanbor Wang 2001-07-22 16:57:44 CDT
Some more ideas to solve this problem. We could use the Muxed connection 
strategy too. There is an opinion to change the default setting in TAO to a 
Muxed strategy as compared to Exclusive strategy now. The exclusive strategy 
wastes lots of resources. I guess it is a good change after TAO 1.2 goes out.
Comment 3 Irfan Pyarali 2001-07-30 12:12:57 CDT
Accepted for tao-support.
Comment 4 Nanbor Wang 2001-09-25 16:16:39 CDT
This bug is not valid. We have come a long way from this. We are using muxed
connections that solves the problems to an extent. But we run out of stack space
if there are large number of iterations.  But the next biggest problem is that
the TP_Reactor does not treat the handles in a fair manner. Plese refer 1031 for
the fixes that we plan to do for this and more
Comment 5 Nanbor Wang 2001-09-25 17:19:07 CDT
I am reopening the bug.  The reason it is reopened is that we still see failures
which are very much different from what was reported first. But the probelms are
now with the TP_Reactor. Adding a dependency to 1031 so that we can lsoe this
after we close that.
Comment 6 Nanbor Wang 2001-09-25 17:19:34 CDT
Accepting the bug 
Comment 7 Nanbor Wang 2001-10-01 10:54:23 CDT
To whomsovere fixes this problem: Please revert this change "Tue Sep 25 17:40:08
2001  Balachandran Natarajan  <bala@cs.wustl.edu>". The change has been put in
to lessen the noise because of this problem.