Bug 3683

Summary: issue with error recovery with rw strategy
Product: TAO Reporter: Johnny Willemsen <jwillemsen>
Component: ORBAssignee: DOC Center Support List (internal) <tao-support>
Status: NEW ---    
Severity: normal    
Priority: P3    
Version: 1.6.9   
Hardware: All   
OS: Solaris   
Bug Depends on:    
Bug Blocks: 3682    

Description Johnny Willemsen 2009-05-29 05:01:10 CDT
memory leak with rw strategy
Comment 1 Johnny Willemsen 2009-05-29 07:17:55 CDT
When trying to send a very large message on Solaris we see this behaviour below. The client just does 1 call, the ORB seems to do a retry which loops forever. This is using the bug_3683_regression which just has been added to the repository, only echo_client_i is modified to pass a very long string.

74 20 61 62 72 65 61 73  74 20 61 62 72 69 64 67   t abreast abridg
65 20 61 62 72 69 64 67  6d 65 6e 74 20 61 62 72   e abridgment abr
6f 61 64 20 61 62 72 6f  67 61 74 65 20 61 62 72   oad abrogate abr
75 70 74 20 61 62 73 63  65 73 73 20 61 62 73 63   upt abscess absc
69 73 73 61 20 61 62 73  63 69 73 73 61 65 20 61   issa abscissae a
62 73 65 6e 63 65 20 61  62 73 65 6e 74 20 61 62   bsence absent ab
73 65 6e 74 65 65 20 61  62 73 65 6e 74 65 65 69   sentee absenteei
73 6d 20 61 62 73 65 6e  74 69 61 20 61 62 73 65   sm absentia abse
6e 74 6d 69 6e 64 65 64  20 61 62 73 69 6e 74 68   ntminded absinth
65 20 61 62 73 6f 6c 75  74 65 20 61 62 73 6f 6c   e absolute absol
75 74 69 6f 6e 20 61 62  73 6f 6c 76 65 20 61 62   ution absolve ab
TAO (4557|1) - Transport[9]::cleanup_queue, byte_count = 65536
TAO (4557|1) - Transport[9]::cleanup_queue, after transfer, bc = 0, all_sent = 0, ml = 761248
TAO (4557|1) - Transport[9]::drain_queue_helper, byte_count = 65536, head_is_empty = 0
TAO (4557|1) - Transport[9]::drain_queue_i, helper retval = 1
TAO (4557|1) - Transport[9]::send_synchronous_message_i, error while scheduling flush - Not enough space
TAO (4557|1) - Connection_Handler[9]::close_connection_eh, purging entry from cache
TAO (4557|1) - Transport_Cache_Manager_T::mark_connected, false Transport[9]
TAO (4557|1) - Transport[9]::cleanup_queue_i, cleaning up complete queue
TAO (4557|1) - Transport[9]::cleanup_queue_i, discarded 0 messages, 0 bytes.
TAO (4557|1) - Transport[9]::cleanup_queue_i, cleaning up complete queue
TAO (4557|1) - Transport[9]::cleanup_queue_i, discarded 0 messages, 0 bytes.
TAO (4557|1) - TAO_LF_CH_Event[9]::state_changed_i, state LFS_SUCCESS->LFS_CONNECTION_CLOSED
TAO (4557|1) - Connection_Handler[9]::close_connection_eh
TAO (4557|1) - IIOP_Transport[9]::send_message, write failure - Not enough space
TAO (4557|1) - Remote_Invocation::send_message, failure while sending message
TAO (4557|1) - Connection_Handler[9]::close_connection_eh, purging entry from cache
TAO (4557|1) - Transport[9]::cleanup_queue_i, cleaning up complete queue
TAO (4557|1) - Transport[9]::cleanup_queue_i, discarded 0 messages, 0 bytes.
TAO (4557|1) - Transport[9]::cleanup_queue_i, cleaning up complete queue
TAO (4557|1) - Transport[9]::cleanup_queue_i, discarded 0 messages, 0 bytes.
TAO (4557|1) - Connection_Handler[9]::close_connection_eh
TAO (4557|1) - Transport[9]::make_idle
TAO (4557|1) - Transport[9]::cleanup_queue_i, cleaning up complete queue
TAO (4557|1) - Transport[9]::cleanup_queue_i, discarded 0 messages, 0 bytes.
TAO (4557|1) - Invocation_Adapter::invoke_i, handling forwarded locations
TAO (4557|1) - Request timeout is 15768000000 milliseconds
TAO (4557|1) - Transport_Cache_Manager_T::fill_set_i, current_size = 0, cache_maximum = 128
TAO (4557|1) - Transport_Cache_Manager_T::purge, Cache size after purging is [0]
TAO (4557|1) - IIOP_Connector::begin_connection, to <css5:35878> which should block
TAO (4557|1) - TAO_LF_CH_Event[0]::state_changed_i, state LFS_IDLE->LFS_CONNECTION_WAIT
TAO (4557|1) - IIOP_Connection_Handler[8134192] ctor, this=612630
TAO (4557|1) - IIOP_Connection_Handler::open, The local addr is <136.225.108.120:44248>
TAO (4557|1) - IIOP_Connection_Handler::open, IIOP connection to peer <136.225.108.120:35878> on 9
TAO (4557|1) - Transport::post_open, tport id changed from 8134192 to 9
TAO (4557|1) - Transport[9]::post_open, cache_map_entry_ is 0
TAO (4557|1) - TAO_LF_CH_Event[9]::state_changed_i, state LFS_CONNECTION_WAIT->LFS_SUCCESS
TAO (4557|1) - Cache_IntId_T::Cache_IntId_T, this=fffffd7fffdfe2f0 Transport[9] is connected
TAO (4557|1) - Cache_IntId_T::recycle_state, ENTRY_UNKNOWN->ENTRY_IDLE_AND_PURGABLE Transport[9] IntId=fffffd7fffdfe2f0
TAO (4557|1) - Transport_Cache_Manager_T::bind_i, Transport[9] @ hash:index{-1998456674:0}
TAO (4557|1) - Transport_Cache_Manager_T::bind_i: Success Transport[9] @ hash:index{-1998456674:0}. Cache size is [1]
TAO (4557|1) - IIOP_Connector::make_connection, new connected connection to <css5:35878> on Transport[9]
TAO (4557|1) - Transport_Connector::connect, opening Transport[9] in TAO_CLIENT_ROLE
TAO (4557|1) - Transport_Cache_Manager_T::is_entry_available_i[9], true state is ENTRY_IDLE_AND_PURGABLE
TAO (4557|1) - Cache_IntId_T::recycle_state, ENTRY_IDLE_AND_PURGABLE->ENTRY_BUSY Transport[9] IntId=466560
TAO (4557|1) - Transport_Cache_Manager_T::find_i, Found available Transport[9] @hash:index {-1998456674:0}
TAO (4557|1) - Transport_Connector::connect, got an existing connected Transport[9] in role TAO_CLIENT_ROLE
TAO (4557|1) - Codeset_Manager_i::set_tcs, setting char translator (00010001)
TAO (4557|1) - Codeset_Manager_i::set_tcs, setting wchar translator (00010109)
TAO (4557|1) - Exclusive_TMS::request_id - <1>
TAO (4557|1) - Codeset_Manager_i::generate_service_context, using tcs_c <ISO8859_1> (00010001), tcs_w <UTF-16> (00010109)
TAO (4557|1) - GIOP_Message_Base::dump_msg, send GIOP message v1.2, 826772 data bytes, my endian, Type Request[1]
GIOP message - HEXDUMP 826784 bytes (showing first 912 bytes)
47 49 4f 50 01 02 01 00  94 9d 0c 00 01 00 00 00   GIOP............
03 00 00 00 00 00 00 00  31 00 00 00 14 01 0f 00   ........1.......
4e 55 50 00 00 00 1a 00  00 00 00 01 00 00 00 52   NUP............R
6f 6f 74 50 4f 41 00 63  68 69 6c 64 5f 70 6f 61   ootPOA.child_poa
00 00 00 00 00 01 00 00  00 45 63 68 6f 00 00 00   .........Echo...
0c 00 00 00 65 63 68 6f  5f 73 74 72 69 6e 67 00   ....echo_string.
Comment 2 Johnny Willemsen 2009-05-29 07:18:08 CDT
the problem doesn't appear when we don't use rw
Comment 3 Johnny Willemsen 2009-05-29 07:42:55 CDT
Seems we need to flush the data explicitly, but the flushing strategy is LF based by defalt. By using "static Client_Strategy_Factory "-ORBClientConnectionHandler RW"" we don't use LF, but the flushing strategy was not changed. Will retest with a different flushing strategy. if the flushing strategy gives the problem, the orb at least shouldn't keep retrying


Comment 4 Johnny Willemsen 2009-05-29 08:13:26 CDT
I think there are two bugs, first the flushing strategy can use LF and LF can't be used for clientconnection handling

Also, when the flush fails, we give a TAO_INVOKE_RESTART back to Invocation_Adapter::invoke_i. This assumes we got a location forward. The problem is that we only should do a location forward when for example Invocation_Adapter::invoke_twoway does return TAO_INVOKE_RESTART and we have a reply status indicating location forward. I think we need to pass this reply status back to Invocation_Adapter::invoke_i and only do a look while status == restart and we have a reply status location forward

    if (status == TAO_INVOKE_RESTART &&
        (synch.reply_status () == GIOP::LOCATION_FORWARD ||
         synch.reply_status () == GIOP::LOCATION_FORWARD_PERM))
Comment 5 Johnny Willemsen 2009-05-29 08:25:43 CDT
Invocation_Adapter::invoke_i is definitively wrong, we should check whether we really got a location forward or not. This would mean breaking several internal APIs
Comment 6 Johnny Willemsen 2009-06-02 02:07:59 CDT
Tue  Jun  2 06:47:43 UTC 2009  Johnny Willemsen  <jwillemsen@remedy.nl>

        * docs/Options.html:
          When we use ClientConnectionHandler RW we don't use the LF
          loop. In case the buffers are full, we use the flushing
          strategy for which the default is leader follower, but that
          will not work because we don't use the lf loop in this case.
          Added a note the ClientConnectionHandler RW needs a
          FlusingStrategy of blocking

        * tests/Bug_3683_Regression/Echo_Client_i.cpp
        * tests/Bug_3683_Regression/Echo_Client_i.h
        * tests/Bug_3683_Regression/run_test.pl
        * tests/Bug_3683_Regression/Simple_util.cpp
          Use a commandline argument -p to indicate the size of the
          string in megabytes.

        * tests/Bug_3683_Regression/svc.conf
          Set the FlushingStrategy to blocking
Comment 7 Johnny Willemsen 2009-06-02 02:09:02 CDT
this is more a configuration issue, RW and blocking flushing should be used together. if RW and lf flushing is used, it exhibits an issue with the error recovery in the orb