Bug 2755 - Transport Cache Manager: races in recache_transport and is_wakeup_useful
Summary: Transport Cache Manager: races in recache_transport and is_wakeup_useful
Status: RESOLVED WORKSFORME
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.5.3
Hardware: x86 Linux
: P3 normal
Assignee: Chris Reed
URL:
Depends on: 3024 2935
Blocks:
  Show dependency tree
 
Reported: 2006-12-19 11:20 CST by Chris Reed
Modified: 2010-09-17 06:43 CDT (History)
3 users (show)

See Also:


Attachments
Potential fix; Note; I've changed the return value for purge_entry to return 1 if the entry is NULL - this is required for recache_*_transport to not cache an entry if another thread has already removed it and is about to re-add the transport. (5.88 KB, patch)
2006-12-19 11:22 CST, Chris Reed
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Reed 2006-12-19 11:20:58 CST
Handling inputs on multiple concurrent incoming connections can lead to the TAO
Transport Cache getting a single transport being present at multiple indices in
the transport cache, which are un-purgable.

    SYNOPSIS:
A test that has JacORB open multiple (less than 100) connections to TAO and
sending requests on those connections concurrently leads to
Transport_Cache_Manager index growing to over 500, resulting in purge trying to
purge transports, but being unable to as most of them are marked as busy.
TAO then stops responding to any connections from that ORB.

Also, using:
 -ORBSvcConfDirective static Resource_Factory "-ORBMuxedConnectionMax 1"
causes TAO to crash, due to a race between make_idle and is_wakeup_useful.

    DESCRIPTION:
We are using bidirectional GIOP, and the
IIOP_Connection_Handler::process_listen_point_list method is calling
recache_transport.  In recache_transport, if two threads do this interleaved,
they can both call purge_entry, and then both call cache_transport, which adds
it to the cache twice.  Then only one cache entry is marked idle by
process_listen_point_list.

There is also a race condition between make_idle setting the
last_entry_returned_ pointer to a cache entry which is then re-cached (i.e. the
entry is deleted and a new entry created) and is_wakeup_useful which reads that
pointer.  Fixed so is_wakeup_useful performs a search in the
Transport_Cache_Manager (just like find_i)
Comment 1 Chris Reed 2006-12-19 11:22:19 CST
Created attachment 642 [details]
Potential fix; Note; I've changed the return value for purge_entry to return 1 if the entry is NULL - this is required for recache_*_transport to not cache an entry if another thread has already removed it and is about to re-add the transport.
Comment 2 Andy Guy 2007-02-10 11:22:25 CST
I applied this patch to 1.5.1 but have found further problems with JacORB 
clients, BiDir and MuxedConnectionMax = 1. JacORB repeatedly sends the BiDir 
service context meaning that recache_transport (or as it is now 
recache_idl_transport) is repeatedly called even when an outgoing request is 
being made by another thread. Both before and after the patch the incoming 
request with the unnecessary service context will try and make the transport 
idle even if it is in the busy state - leading to problems if there is more 
than one thread trying to make outgoing calls. Packets are then corrupted 
(usually a request is duplicated or dropped).

Given the transport will be in the idle state before recache is called for the 
first time just preserving the state should be enough and therefore fixing 
recache to do this and remove the make_idle call would be enough. It would 
probably also make sense to ignore further bidir service contexts once one has 
been set. My reading of the standard makes it unclear if further ones are 
allowed - they could change the set listening end points?
Comment 3 Andy Guy 2007-02-12 05:32:31 CST
There is another potential problem. If the bidir service context has more than 
one end-point then the following sequence can occur:

1. First end point is put into the transport cache and the transport is made 
idle.

2. An out going request is started setting the transport to busy.

3. Second end point is put into the cache & transport is made idle - but 2 is 
still using it.

4. A second out going request is start on the transport as it is idle.

The two going requests will then corrupted the transport.

So I believe both fixes are required. If transport is already in bidir mode 
don't do anything (this is purely for performance). Make recache_transport 
preserve the state and don't call make_idle.
Comment 4 Dale Wilson 2007-07-24 13:30:22 CDT
The fix for bug #2935 (see the dependency above) will probably also fix this problem although it takes a different approach.

When I check in the 2935 changes I'll post another comment here.  Chris and/or Andy: it would be great if you could rerun the test with JacORB when this fix is in the repo.  -- watch this space--

Dale
Comment 5 Johnny Willemsen 2007-08-06 07:39:47 CDT
back to reporter, please retest when x.6 is availabe later this month
Comment 6 Phil Mesnier 2007-08-14 13:01:18 CDT
Added a dependency on bug 3024. The transport cache refactoring takes care of at least the is_wakeup_useful race.
Comment 7 Adam Mitz 2008-08-07 14:34:43 CDT
2935 has been (hopefully) fixed.  I read the patch in the attachment but now the code in trunk has diverged significantly.  Andy/Chris is there a regression test available (even one that requires JacORB)?
(Note that 3024 is not resolved yet.)
Comment 8 Johnny Willemsen 2010-09-17 06:43:02 CDT
closing, no fix and recent testing with jacorb shows stable usage