Bug 3626

Summary: Corba client start to leak when TCP connection failed.
Product: TAO Reporter: Steven Xie <steven.xie>
Component: ORBAssignee: DOC Center Support List (internal) <tao-support>
Status: NEW ---    
Severity: normal CC: elliott_c, milan.cvetkovic, mitza, steven.xie
Priority: P3    
Version: 1.6.8   
Hardware: x86   
OS: Solaris   
Attachments: Patch for connection handler leakage on Solaris platform.
Proposed patch to $TAO_ROOT/tao/Connection_Handler.cpp

Description Steven Xie 2009-03-16 14:21:51 CDT
REPEAT BY:
I made a simple change here to reproduce it.
mpathix/home/sx/ace-5.6.8/TAO/tests/Hello$ diff client.cpp client.cpp.ori
48c48
<       Test::Hello_var hello = Test::Hello::_unchecked_narrow(tmp.in ());
---
 >       Test::Hello_var hello = Test::Hello::_narrow(tmp.in ());
57,60c57
< for (;;)
< {
< try
< {
---
 >
65,71d61
< }
< catch(const CORBA::Exception& ex)
< {
<       printf("ignore corba exception\n");
<       continue;
< }
< }
./client -k corbaloc:iiop:fsa:9999/whatever

It looks like  TAO_IIOP_Connection_Handler is leaked somehow.
I can see handler's ref_count down to 1, but handler never got removed after the connection failure.
I can't reproduce it with old 5.5.1. So it seems to be introduced between 5.5.1 and 5.6.8.
Comment 1 Johnny Willemsen 2009-03-16 14:28:28 CDT
Adam, could this be related to all changes of last summer?

Steven, can you retest with 1.6.5, 1.6.6 and 1.6.7, then we know better where this got introduced?
Comment 2 Adam Mitz 2009-03-16 14:32:47 CDT
Steven, please recompile libTAO with -DTAO_LOG_CH_REF_COUNTS and run with -ORBDebugLevel 10.  This will give you full logging of all connection handler reference count activity.
Comment 3 Steven Xie 2009-03-16 14:55:29 CDT
ignore corba exception
ACE (19535|1) SCG:<ctor=80475e0> - config=80685f0 repo=8068668 superceded by repo=8068668
ACE (19535|1) DSB::instance, repo=8068668, name=TAO_ORB_Core_Static_Resources type=8067290 => 806c3a8
TAO (19535|1) - Transport_Cache_Manager_T::fill_set_i: current_size = 0, cache_maximum = 2048
TAO (19535|1) - Transport_Cache_Manager_T::purge: Cache size after purging is [0]
TAO (19535|1) - IIOP_Connector::begin_connection, to <fsa:9999> which should block
TAO (19535|1) - TAO_LF_CH_Event[0]::state_changed_i state 2 prev 0
cons TAO_IIOP_Connection_Handler
11111111111111111111111111
TAO (19535|1) - IIOP_Connection_Handler[137028904] ctor, this = 137012240
add 0x82aa410 2
TAO (19535|1) - IIOP_Connection_Handler[137028904]::add_reference, up to 2
TAO (19535|1) - TAO_LF_CH_Event[137028904]::state_changed_i state 6 prev 2
 0x82aa410
rraaaa  0x82aa410 1
TAO (19535|1) - IIOP_Connection_Handler[137028904]::remove_reference, down to 1
TAO (19535|1) - IIOP_Connector::make_connection, connection to <fsa:9999> failed (errno: Connection refused)
TAO (19535|1) - Transport_Connector::connect, make_connection failed
ACE (19535|1) SCG:<dtor=80475e0> - new repo=8068668
ignore corba exception
Comment 4 Steven Xie 2009-03-19 10:41:05 CDT
Here is what I found, 
int
TAO_Connection_Handler::close_handler (u_long flags)
{
.....
  // We only need to remove the reference from the transport if there
  // were connections pending at the time that the handler is closed or
  // the handler is being closed during a new connection.
  if (pending || ACE_BIT_DISABLED(flags, CLOSE_DURING_NEW_CONNECTION))
    transport->remove_reference ();
From ACE_Connector, the falg is set as CLOSE_DURING_NEW_CONNECTION.
In this case, remove_reference won't be called, which causes the leak.
Based on the comments, Should it be
 if (pending || ACE_BIT_ENABLED(flags, CLOSE_DURING_NEW_CONNECTION))
    transport->remove_reference ();
instead?
It works for me now. But I don't know why it's DISABLED checked? and what side effects will the change be?



Comment 5 Johnny Willemsen 2009-03-19 10:43:59 CDT
Added Chad, Chad, you added the lines Steven Xie points at, can you have a look?
Comment 6 Steven Xie 2009-04-09 08:39:18 CDT
Created attachment 1124 [details]
Patch for connection handler leakage on Solaris platform.
Comment 7 Steven Xie 2009-04-09 08:43:34 CDT
I tested this patch on Solaris x86/Linux platform. Both work fine.
Comment 8 Adam Mitz 2009-04-10 08:22:58 CDT
(In reply to comment #7)
> I tested this patch on Solaris x86/Linux platform. Both work fine.
> 

Does this mean you have run the entire TAO test suite on those platforms with this patch?
Comment 9 Steven Xie 2009-05-19 10:43:49 CDT
There's still  some problem. I only catch this issue on Solaris, it doesn't show up on linux.
Comment 11 milan.cvetkovic 2009-05-19 17:24:15 CDT
This can be reproduced on linux machines as well:

cat >block_connect.conf <<EOF
dynamic Client_Strategy_Factory Service_Object *
        libTAO.so.1.6.8:_make_TAO_Default_Client_Strategy_Factory ()
        "-ORBConnectStrategy blocked"
EOF

./client -ORBSvcConf block_connect.conf -ORBDebugLevel 10 \
 -k corbaloc:iiop:myhost:9999/whatever

after making sure there is no corba server running on myhost:9999
Comment 12 Steven Xie 2009-05-20 10:20:21 CDT
The issue is exposed when different system errors we got from OS(by default linux won't see this issue, but solaris will see it right away).
"pending" is always false based on the testings we did.
Neither ACE_BIT_DISABLED(flags, CLOSE_DURING_NEW_CONNECTION) or ACE_BIT_ENABLED(flags, CLOSE_DURING_NEW_CONNECTION) covers all the cases.
Is it ok to remove this condition check?
"if (pending || ACE_BIT_DISABLED(flags, CLOSE_DURING_NEW_CONNECTION))"

Comment 13 Steven Xie 2009-05-20 10:38:22 CDT
(In reply to comment #12)
> The issue is exposed when different system errors we got from OS(by default
> linux won't see this issue, but solaris will see it right away).
> "pending" is always false based on the testings we did.
> Neither ACE_BIT_DISABLED(flags, CLOSE_DURING_NEW_CONNECTION) or
> ACE_BIT_ENABLED(flags, CLOSE_DURING_NEW_CONNECTION) covers all the cases.
> Is it ok to remove this condition check?
> "if (pending || ACE_BIT_DISABLED(flags, CLOSE_DURING_NEW_CONNECTION))"
> 

for example, when connection failed on the same host, solaris will set errno=146,
when connection failed on remote host, errno will be 11.

Comment 14 milan.cvetkovic 2009-05-22 09:51:52 CDT
The fix Steven mentioned in comment #6 (reversing the modifications in

https://svn.dre.vanderbilt.edu/viewvc/Middleware/trunk/TAO/tao/Connection_Handler.cpp?r1=80603&r2=81742 

seems to work properly.

We have run all tests in TAO by executing

$ACE_ROOT/bin/auto_run_tests.pl -o -t

and found the exact same set of failed test cases with and without the patch.

This modification was entered as part of fixes for Bug 2935 based on the ReleaseDoc:

> Tue May 20 18:05:08 UTC 2008  Chad Elliott  <elliott_c@ociweb.com>

$TAO_ROOT/tests/Bug_2935_Regression does not fail.

Comment 15 milan.cvetkovic 2009-05-26 11:14:29 CDT
Created attachment 1155 [details]
Proposed patch to $TAO_ROOT/tao/Connection_Handler.cpp

The patch (by Steven.Xie at mpathix.com) used for full TAO test from Comment #14
Comment 16 Johnny Willemsen 2009-05-26 13:33:53 CDT
what about using ACE_BIT_ENABLED instead of DISABLED?

Comment 17 milan.cvetkovic 2009-05-26 13:57:48 CDT
That patch, although consistent with the comments in code, is not good.

It resulted with handle leakage in some scenarios. We never run the full TAO test, but we ran out of file handles in client pretty quickly when the server process was not running.
Comment 18 Steven Xie 2009-05-27 13:46:05 CDT
(In reply to comment #16)
> what about using ACE_BIT_ENABLED instead of DISABLED?
> 

As comments 12, it's all depends on which errno we got from OS. ACE_BIT_ENABLED still has leaks under certain senario. The only option for us is remove this check fully.
"if (pending || ACE_BIT_DISABLED(flags, CLOSE_DURING_NEW_CONNECTION))"
Comment 19 Johnny Willemsen 2009-05-27 13:50:13 CDT
Chad, any ideas as author of the patch?