Bug 2935 - Too many connections opened to endpoint.
Summary: Too many connections opened to endpoint.
Status: RESOLVED FIXED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.6.1
Hardware: All All
: P3 major
Assignee: Adam Mitz
URL:
Depends on: 2934
Blocks: 3068 2755 3000 3057
  Show dependency tree
 
Reported: 2007-05-16 11:15 CDT by Dale Wilson
Modified: 2008-10-30 13:53 CDT (History)
1 user (show)

See Also:


Attachments
Dale's changes plus a few of my own. (89.56 KB, patch)
2008-04-30 09:40 CDT, Chad Elliott
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dale Wilson 2007-05-16 11:15:10 CDT
This problem can show up in either a multi-threaded or a nested upcall situation.  I'm working with it in a nested upcall situation.
 	
A [simplified] description of a typical scenario is:

    * Incoming request #1 triggers Outgoing Request #A
    * Outgoing Request A discovers that there is no connection to the
      endpoint, so it starts to open Connection Alpha.
    * It drops into the leader/follower + reactor to continue to service 
      new events while waiting for Connection Alpha to complete.  At this 
      point it has not updated the transport cache
    * Incoming request #2 triggers Outgoing Request #B [A nested upcall]
    * Outgoing Request B discovers that there is no connection to the
      endpoint (Connection Alpha is invisible) so it opens Connection Beta
      and drops into the leader/follower + reactor.
    * Additional upcalls at this point would open additional connections
      but I won't complicate this scenario by describing the details.
    * Connection Alpha completes.  The call back from the reactor adds
      Connection Alpha to the transport cache marked as busy because the code
      expects to use it immediately.
    * Additional upcalls may occur at this point.  They will find Connection             
      Alpha but it is busy, so they will open yet, another new connection.
    * If the stack unwinds back to OutgoingRequest #B it discovers
      Connection Beta is still not complete so it continues to wait.  Notice
      that Connection Alpha is complete (but marked busy).
    * Connection Beta completes.  The call back from the reactor adds
      Connection Beta to the cache.
    * Outgoing request B can now be delivered over Connection Beta.
    * Outgoing request A can now be delivered over Connection Alpha.

After that the system operates "normally"

I have a test for this and am working on a fix -- sponsored by ATD.
Comment 1 Johnny Willemsen 2007-07-18 03:11:23 CDT
maybe this relates also to bug 2755
Comment 2 Johnny Willemsen 2007-07-23 14:16:06 CDT
added bug 3000 as dependency because that is also related to cache management
Comment 3 Dale Wilson 2007-08-01 13:42:21 CDT
Changes checked in for x.6

Wed Aug  1 15:54:01 UTC 2007  Dale Wilson  <wilsond@ociweb.com>
Comment 4 Johnny Willemsen 2007-08-21 03:14:17 CDT
reopened, code has been reverted from the repository because of test result drop. The regression test now fails again on head
Comment 5 Johnny Willemsen 2007-11-26 13:32:16 CST
Any update on a real fix of this issue (and related ones)?
Comment 6 Chad Elliott 2008-04-30 09:39:06 CDT
Reassigning this bug to me.
Comment 7 Chad Elliott 2008-04-30 09:40:12 CDT
Created attachment 967 [details]
Dale's changes plus a few of my own.

I plan on committing these changes very soon.
Comment 8 Johnny Willemsen 2008-05-01 01:47:12 CDT
I will look at the diff in detail next week, but it seems some ACE_TEXT are missing from the ACE_DEBUG lines
Comment 9 Chad Elliott 2008-05-01 07:20:43 CDT
(In reply to comment #8)
> I will look at the diff in detail next week, but it seems some ACE_TEXT are
> missing from the ACE_DEBUG lines
> 

Why would that be necessary?  There are two log methods on ACE_Log_Msg one that takes ACE_TCHAR* and one that takes ACE_ANTI_TCHAR* (when ACE_HAS_WCHAR is defined).  As long as I'm not printing anything that has wide characters in it, ACE_TEXT is not necessary.

If there is an actual reason why they need to be there that I don't know, I will be glad to go back and add them.
Comment 10 Johnny Willemsen 2008-05-01 12:47:02 CDT
Shouldn't there be ACE_TEXT around the strings passed to ACE_DEBUG? For example:

+
+                  ACE_DEBUG ((LM_DEBUG,
+                              "TAO (%P|%t) - Transport_Connector::connect, "
+                              "got an existing %s Transport[%d] in role %s\n",
+                              base_transport->is_connected () ? "connected" :
+                                                                "unconnected",
+                              base_transport->id (),
+                              cr == TAO::TAO_SERVER_ROLE ? "TAO_SERVER_ROLE" :
+                              cr == TAO::TAO_CLIENT_ROLE ? "TAO_CLIENT_ROLE" :
+                              "TAO_UNSPECIFIED_ROLE" ));

where there would be a ACE_TEXT around the messages. Also %s is for tchar strings, %C is for cstrings. I do know that this has been used incorrectly in the past.

Would it be ok to hold this after x.6.5, we want to fix the service gestalt issue reporter on the mailing list asap and release a new micro release then
Comment 11 Chad Elliott 2008-05-01 12:53:11 CDT
(In reply to comment #10)
> Shouldn't there be ACE_TEXT around the strings passed to ACE_DEBUG? For
> example:
> 
> +
> +                  ACE_DEBUG ((LM_DEBUG,
> +                              "TAO (%P|%t) - Transport_Connector::connect, "
> +                              "got an existing %s Transport[%d] in role %s\n",
> +                              base_transport->is_connected () ? "connected" :
> +                                                                "unconnected",
> +                              base_transport->id (),
> +                              cr == TAO::TAO_SERVER_ROLE ? "TAO_SERVER_ROLE" :
> +                              cr == TAO::TAO_CLIENT_ROLE ? "TAO_CLIENT_ROLE" :
> +                              "TAO_UNSPECIFIED_ROLE" ));
> 
> where there would be a ACE_TEXT around the messages. Also %s is for tchar
> strings, %C is for cstrings. I do know that this has been used incorrectly in
> the past.

As long as wide characters are not involved at all, ACE_TEXT is not necessary.

> 
> Would it be ok to hold this after x.6.5, we want to fix the service gestalt
> issue reporter on the mailing list asap and release a new micro release then
> 

I don't see a problem with that.
Comment 12 Johnny Willemsen 2008-06-25 05:50:10 CDT
Is this now in svn? 
Comment 13 Chad Elliott 2008-06-25 06:40:55 CDT
This is now in the repository with the following ChangeLog entries.

ACE:
Tue May 20 17:55:39 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Mon Jun 16 19:06:44 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>

TAO
Tue May 20 18:05:08 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Thu May 22 11:43:44 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Fri May 30 13:21:20 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Tue Jun  3 15:15:10 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Thu Jun  5 11:56:09 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Thu Jun  5 18:02:10 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Wed Jun 11 15:54:42 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Wed Jun 11 19:07:03 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Wed Jun 18 13:22:32 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Wed Jun 18 14:39:38 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
Comment 14 Johnny Willemsen 2008-07-30 06:51:52 CDT
reopen, this seems to have a too high fail rate for the test stats at this moment:

2008_07_30      23     7.0       69.57  TAO   TAO/tests/Bug_2935_Regression
Comment 15 Adam Mitz 2008-08-07 14:07:53 CDT
(In reply to comment #14)
> reopen, this seems to have a too high fail rate for the test stats at this
> moment:
> 
> 2008_07_30      23     7.0       69.57  TAO   TAO/tests/Bug_2935_Regression
> 

Latest fix has been committed.  I'll check the scoreboard tomorrow to see if we can close this bug.

Thu Aug  7 17:46:45 UTC 2008  Adam Mitz  <mitza@ociweb.com>

        This commit fixes Bug 2935 (again) and resolves other transport cache
        management issues.  The key issue of 2935 is that the transport is
        cached while the connection is in progress, so that other threads see
        that there's a connection in progress and don't start their own.
        That feature is essentially restoring one of the changes from
        Tue May 20 18:05:08 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>
        that were reverted in
        Fri May 30 13:21:20 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>

(see $TAO_ROOT/ChangeLog for more details)
Comment 16 Adam Mitz 2008-08-08 11:12:54 CDT
(In reply to comment #15)

> Latest fix has been committed.  I'll check the scoreboard tomorrow to see if we
> can close this bug.

There were test failures relating to non-blocking connects.  These tests failed:
TAO/tests/Bug_1476_Test/run_test.pl
TAO/tests/Timed_Buffered_Oneways/run_test.pl
TAO/tests/Blocking_Sync_None/run_test.pl
TAO/tests/Oneway_Buffering/run_message_count.pl
TAO/tests/Oneway_Buffering/run_buffer_size.pl
TAO/tests/Oneway_Buffering/run_timeout.pl
TAO/tests/Oneway_Buffering/run_timeout_reactive.pl
TAO/tests/AMI_Buffering/run_message_count.pl
TAO/tests/AMI_Buffering/run_buffer_size.pl
TAO/tests/AMI_Buffering/run_timeout_reactive.pl

Should be fixed by:

Fri Aug  8 15:25:13 UTC 2008  Adam Mitz  <mitza@ociweb.com>

        * tao/Transport_Cache_Manager.cpp:

          Fixed whitespace.

        * tao/Transport_Connector.cpp:

          Add the transport to the cache (ENTRY_CONNECTING) in the non-blocked
          case.  This covers non-blocking invocations such as AMI and oneways
          with SYNC_NONE, etc.
Comment 17 Adam Mitz 2008-10-30 13:53:16 CDT
Should have been closed when x.6.6 was released.