Please report new issues athttps://github.com/DOCGroup
We have a test in TAO that has been failing ever since I checked in my fixes for 575. I have finally got a chance to figure out why it is failing after running in different directions (after staring at things constinously :( ) . Here is my finding. I would like every one to go through this and let me know how to deal with this problem (whether it is an application problem or an ORB problem). Ok, here is the outline. - There is a guy called "admin" who just monitors stuff - The server starts up - The client is started. - There is some communication to sync up the server, admin and client (not really to our interest at this point) - The client bangs buffered oneways to the server with a good amount of payload - The server after receieving a call just turns to the admin to say how much of payload he received (this is a two way call) Now to the actual problem. When the client requests start banging in, the server tries to turn around to make two way calls on the admin. The server tries this during the process - sends a call - and then goes to wait () on the LF Before the reply from the admin gets in, the reactor wakes up the thread makes it a server to process more oneways (there is only one thread in every process). This makes the server establish new connections to the admin to send the payload information. This process goes on ie, getting more messages and establishing new connections to the admin. Somewhere along the line, the wait () reads the messages and then the stack unwinds a bit. But establishing more connections keep growing our connection cache. After that there is havoc. Linux dumps a core (we would be lucky if that doesnt happen), as things have become inconsistent, on Solaris things hang.. Looks like there is a memory corruption when we exceed the handle limits on Linux. We get SEGV's in assorted places. Sometimes the test hangs. How did this work before 575? the answer is simple. We never resumed the handle during the upcall to the server (and in turn the remote call to the admin). After our upcall is done we resumed the handle. So things were happy as we proceeded to process the next call and so on.. OK, now to the dissection or post-mortem analysis - Is the test wrong? - One solution is to use a oneway to the admin from the server (I have tested that and it works) , but is that always feasible? - Another solution could be to use a RW connection handler on the server - is something wrong with our ORB, I mean the connection purging, resuming handles etc.? - where do such problems need to be solved, at the ORB level (by doing some intelligent stuff, I dont know what now :() or application level (by choosing the right configurations). Any type of input on this would be great.
For the time being, I guess I am going to take a simple way out. I am going to change the operation request_received () in the interface Oneway_Buffering_Admin to a oneway. I know I am not solving the problem but just trying to get around it. Still the question remains ie. is this an ORB problem and where should this be dealt with?
Some more ideas to solve this problem. We could use the Muxed connection strategy too. There is an opinion to change the default setting in TAO to a Muxed strategy as compared to Exclusive strategy now. The exclusive strategy wastes lots of resources. I guess it is a good change after TAO 1.2 goes out.
Accepted for tao-support.
This bug is not valid. We have come a long way from this. We are using muxed connections that solves the problems to an extent. But we run out of stack space if there are large number of iterations. But the next biggest problem is that the TP_Reactor does not treat the handles in a fair manner. Plese refer 1031 for the fixes that we plan to do for this and more
I am reopening the bug. The reason it is reopened is that we still see failures which are very much different from what was reported first. But the probelms are now with the TP_Reactor. Adding a dependency to 1031 so that we can lsoe this after we close that.
Accepting the bug
To whomsovere fixes this problem: Please revert this change "Tue Sep 25 17:40:08 2001 Balachandran Natarajan <bala@cs.wustl.edu>". The change has been put in to lessen the noise because of this problem.