Bug 189 - ORB rejects multiple simultaneous connections
Summary: ORB rejects multiple simultaneous connections
Status: ASSIGNED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.0
Hardware: All All
: P2 normal
Assignee: DOC Center Support List (internal)
URL:
: 970 (view as bug list)
Depends on: 1185
Blocks:
  Show dependency tree
 
Reported: 1999-08-01 14:57 CDT by Marina Spivak
Modified: 2001-09-26 10:50 CDT (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marina Spivak 1999-08-01 14:57:25 CDT
Running a multithreaded Simple_Naming test causes exceptions and/or segfaults
on the client side.
Reproduce by starting the Naming_Service, and then running the client with -m
option:
$Naming_Service &
$../tests/Simple_Naming/client -m 25
The problem does not occur every single run, but it is easily reproducable on
Irix, and on Linux egcs (bambuca). To reproduce, run the client several times
without restarting the server.  I was not able to reproduce the problem purely
on Solaris (just client and server on solaris), however, if a server is started
on Solaris, and a client on linux, eventually client will start getting
exceptions, and, at that point, running a client on Solaris will reproduce the
problem also.
Exceptions received are TRANSIENT with differrent errnos, connection refused
being most frequent.
Comment 1 Carlos O'Ryan 1999-08-10 10:11:59 CDT
I'll take care of this one, it is related to the problem of doing non-blocking
writes and non-blocking connects in the ORB.
Comment 2 Carlos O'Ryan 1999-08-11 11:00:59 CDT
Simply accepting the bugs.
Comment 3 Carlos O'Ryan 1999-08-11 11:02:59 CDT
If bug 132 is resolved then this bug will very likely go away (in the measure
that this problem can be avoided).
Comment 4 Carlos O'Ryan 1999-09-03 15:45:59 CDT
Changed the summary, it wasn't very clear what it meant.
Comment 5 Carlos O'Ryan 2000-01-31 23:22:59 CST
I don't have time to work in this back at any point in the foreseeable future.
I'm returning the bug to the main pool, if anybody wants to take it they are
welcomed to ask questions on how to approach the problem/enhancement.
Comment 6 Carlos O'Ryan 2000-02-08 13:12:59 CST
Bugs moved to the tao-support placeholder.
Comment 7 Carlos O'Ryan 2000-08-08 13:06:01 CDT
*** Bug 609 has been marked as a duplicate of this bug. ***
Comment 8 Carlos O'Ryan 2001-04-19 12:54:23 CDT
The last time we tracked this down the kernel was indeed rejecting the 
request.  In other words, the client was doing the right thing (calling connect
()), the server never heard of the request (so there is not much it can do), 
but the connect() call returned -1.

BTW, the exception raised (TRANSIENT) really means "try again later", so the 
application is getting a good hint about what needs to be done.
Comment 9 Carlos O'Ryan 2001-04-20 11:44:52 CDT
Peter Crowther has provided us with a very important piece of intelligence:

Which Windows OS are you on?  Certainly up to and including NT4SP6, and
probably in Win2K as well, there's a mis-implementation of the TCP state
machine, as follows:

When a listening socket (in this case the server) fills its listen queue
from a large number of simultaneous connection requests, the standard
requires that the server silently drops the excess and the client tries
again on a retransmit.  WinNT incorrectly sends a TCP RST on overflow, which
resets the connection.  That would almost certainly be reported as
ECONNREFUSED on a Winsock client (I was testing this using a Linux client,
so got better reporting).

Carlos: how deep is the backlog parameter set on a listen()?  [Sorry; I know
I should look, but I'm away from my copy of the source]  NT4S can allow
backlogs up to 200 before it starts throwing RSTs.  Also, don't bother
testing this on most UNIXes as they implement correct behaviour.
Comment 10 Carlos O'Ryan 2001-04-20 11:55:34 CDT
Two ideas to follow up on based on Peter comments:

1) Make the default backlog on NT larger (a lot larger)
2) Make the backlog configurable at run-time, maybe an option to the
IIOP_Factory or an option in the IIOP endpoint (ala priority=...?backlog=...)
Comment 11 Carlos O'Ryan 2001-04-20 16:27:16 CDT
Here is another very cool hint from Peter:

> Would you happen to know if this is documented
> in the Microsoft knowledge base or something?

Q113576, if you read between the lines.  Amusingly, they keep trying to
increase the maximum depth without seeming to realise that it's the RST
that's the problem.

Anyone who's downloaded pages from a busy IIS may have been hit by this ---
IIS3, in particular, was notorious for resetting a connection and causing
the browser to report 'The connection with the server was reset'.  This bug
was one of the main causes of early IIS versions being regarded as
unreliable.

> The default value is probably 25 or 15, but you can set it at compile
> time.

I'd be inclined to set it deeper.  Correct behaviour of listen() is to
reduce the backlog to the system's maximum value if a larger value is
requested, although no doubt there are some brain-damaged stacks out there
that won't do this and fail the request instead.

I can vouch for the NT4S limit of 200, too --- I could throw 200 SYNs
at a P200 in about 50ms and it would take them all, but it wouldn't if I
took the limit up to about 205-6 (depending on how fast the application
called accept()).
Comment 12 Carlos O'Ryan 2001-04-21 20:24:08 CDT
More info about this, the link to the issue on Microsoft's knowledge base:

http://support.microsoft.com/support/kb/articles/Q113/5/76.asp

notice that it happens on Win2k server too, this is not Windows' fault however, 
it is simply the case that ACE overrides the default backlog.
Comment 13 Carlos O'Ryan 2001-04-24 13:52:11 CDT
Removed dependency on bug 132, this is not a blocking I/O issue it seems related
to OS limits.  The fix is probably twofold:

1) Control the default backlog sizes on NT, making them as high as possible.
2) Give the user policies to retry the connection several times (non-blocking?)
before reporting a TRANSIENT exception.
Comment 14 Ossama Othman 2001-07-11 15:41:52 CDT
*** Bug 970 has been marked as a duplicate of this bug. ***
Comment 15 Nanbor Wang 2001-09-26 10:50:49 CDT
We had a patch installed from on of our user. We should may be test things 
thoroughly to see if it is still a problem.

Wed Sep  5 12:35:33 2001  Balachandran Natarajan  <bala@cs.wustl.edu>