Please report new issues athttps://github.com/DOCGroup
Running a multithreaded Simple_Naming test causes exceptions and/or segfaults on the client side. Reproduce by starting the Naming_Service, and then running the client with -m option: $Naming_Service & $../tests/Simple_Naming/client -m 25 The problem does not occur every single run, but it is easily reproducable on Irix, and on Linux egcs (bambuca). To reproduce, run the client several times without restarting the server. I was not able to reproduce the problem purely on Solaris (just client and server on solaris), however, if a server is started on Solaris, and a client on linux, eventually client will start getting exceptions, and, at that point, running a client on Solaris will reproduce the problem also. Exceptions received are TRANSIENT with differrent errnos, connection refused being most frequent.
I'll take care of this one, it is related to the problem of doing non-blocking writes and non-blocking connects in the ORB.
Simply accepting the bugs.
If bug 132 is resolved then this bug will very likely go away (in the measure that this problem can be avoided).
Changed the summary, it wasn't very clear what it meant.
I don't have time to work in this back at any point in the foreseeable future. I'm returning the bug to the main pool, if anybody wants to take it they are welcomed to ask questions on how to approach the problem/enhancement.
Bugs moved to the tao-support placeholder.
*** Bug 609 has been marked as a duplicate of this bug. ***
The last time we tracked this down the kernel was indeed rejecting the request. In other words, the client was doing the right thing (calling connect ()), the server never heard of the request (so there is not much it can do), but the connect() call returned -1. BTW, the exception raised (TRANSIENT) really means "try again later", so the application is getting a good hint about what needs to be done.
Peter Crowther has provided us with a very important piece of intelligence: Which Windows OS are you on? Certainly up to and including NT4SP6, and probably in Win2K as well, there's a mis-implementation of the TCP state machine, as follows: When a listening socket (in this case the server) fills its listen queue from a large number of simultaneous connection requests, the standard requires that the server silently drops the excess and the client tries again on a retransmit. WinNT incorrectly sends a TCP RST on overflow, which resets the connection. That would almost certainly be reported as ECONNREFUSED on a Winsock client (I was testing this using a Linux client, so got better reporting). Carlos: how deep is the backlog parameter set on a listen()? [Sorry; I know I should look, but I'm away from my copy of the source] NT4S can allow backlogs up to 200 before it starts throwing RSTs. Also, don't bother testing this on most UNIXes as they implement correct behaviour.
Two ideas to follow up on based on Peter comments: 1) Make the default backlog on NT larger (a lot larger) 2) Make the backlog configurable at run-time, maybe an option to the IIOP_Factory or an option in the IIOP endpoint (ala priority=...?backlog=...)
Here is another very cool hint from Peter: > Would you happen to know if this is documented > in the Microsoft knowledge base or something? Q113576, if you read between the lines. Amusingly, they keep trying to increase the maximum depth without seeming to realise that it's the RST that's the problem. Anyone who's downloaded pages from a busy IIS may have been hit by this --- IIS3, in particular, was notorious for resetting a connection and causing the browser to report 'The connection with the server was reset'. This bug was one of the main causes of early IIS versions being regarded as unreliable. > The default value is probably 25 or 15, but you can set it at compile > time. I'd be inclined to set it deeper. Correct behaviour of listen() is to reduce the backlog to the system's maximum value if a larger value is requested, although no doubt there are some brain-damaged stacks out there that won't do this and fail the request instead. I can vouch for the NT4S limit of 200, too --- I could throw 200 SYNs at a P200 in about 50ms and it would take them all, but it wouldn't if I took the limit up to about 205-6 (depending on how fast the application called accept()).
More info about this, the link to the issue on Microsoft's knowledge base: http://support.microsoft.com/support/kb/articles/Q113/5/76.asp notice that it happens on Win2k server too, this is not Windows' fault however, it is simply the case that ACE overrides the default backlog.
Removed dependency on bug 132, this is not a blocking I/O issue it seems related to OS limits. The fix is probably twofold: 1) Control the default backlog sizes on NT, making them as high as possible. 2) Give the user policies to retry the connection several times (non-blocking?) before reporting a TRANSIENT exception.
*** Bug 970 has been marked as a duplicate of this bug. ***
We had a patch installed from on of our user. We should may be test things thoroughly to see if it is still a problem. Wed Sep 5 12:35:33 2001 Balachandran Natarajan <bala@cs.wustl.edu>