Please report new issues athttps://github.com/DOCGroup
Handling inputs on multiple concurrent incoming connections can lead to the TAO Transport Cache getting a single transport being present at multiple indices in the transport cache, which are un-purgable. SYNOPSIS: A test that has JacORB open multiple (less than 100) connections to TAO and sending requests on those connections concurrently leads to Transport_Cache_Manager index growing to over 500, resulting in purge trying to purge transports, but being unable to as most of them are marked as busy. TAO then stops responding to any connections from that ORB. Also, using: -ORBSvcConfDirective static Resource_Factory "-ORBMuxedConnectionMax 1" causes TAO to crash, due to a race between make_idle and is_wakeup_useful. DESCRIPTION: We are using bidirectional GIOP, and the IIOP_Connection_Handler::process_listen_point_list method is calling recache_transport. In recache_transport, if two threads do this interleaved, they can both call purge_entry, and then both call cache_transport, which adds it to the cache twice. Then only one cache entry is marked idle by process_listen_point_list. There is also a race condition between make_idle setting the last_entry_returned_ pointer to a cache entry which is then re-cached (i.e. the entry is deleted and a new entry created) and is_wakeup_useful which reads that pointer. Fixed so is_wakeup_useful performs a search in the Transport_Cache_Manager (just like find_i)
Created attachment 642 [details] Potential fix; Note; I've changed the return value for purge_entry to return 1 if the entry is NULL - this is required for recache_*_transport to not cache an entry if another thread has already removed it and is about to re-add the transport.
I applied this patch to 1.5.1 but have found further problems with JacORB clients, BiDir and MuxedConnectionMax = 1. JacORB repeatedly sends the BiDir service context meaning that recache_transport (or as it is now recache_idl_transport) is repeatedly called even when an outgoing request is being made by another thread. Both before and after the patch the incoming request with the unnecessary service context will try and make the transport idle even if it is in the busy state - leading to problems if there is more than one thread trying to make outgoing calls. Packets are then corrupted (usually a request is duplicated or dropped). Given the transport will be in the idle state before recache is called for the first time just preserving the state should be enough and therefore fixing recache to do this and remove the make_idle call would be enough. It would probably also make sense to ignore further bidir service contexts once one has been set. My reading of the standard makes it unclear if further ones are allowed - they could change the set listening end points?
There is another potential problem. If the bidir service context has more than one end-point then the following sequence can occur: 1. First end point is put into the transport cache and the transport is made idle. 2. An out going request is started setting the transport to busy. 3. Second end point is put into the cache & transport is made idle - but 2 is still using it. 4. A second out going request is start on the transport as it is idle. The two going requests will then corrupted the transport. So I believe both fixes are required. If transport is already in bidir mode don't do anything (this is purely for performance). Make recache_transport preserve the state and don't call make_idle.
The fix for bug #2935 (see the dependency above) will probably also fix this problem although it takes a different approach. When I check in the 2935 changes I'll post another comment here. Chris and/or Andy: it would be great if you could rerun the test with JacORB when this fix is in the repo. -- watch this space-- Dale
back to reporter, please retest when x.6 is availabe later this month
Added a dependency on bug 3024. The transport cache refactoring takes care of at least the is_wakeup_useful race.
2935 has been (hopefully) fixed. I read the patch in the attachment but now the code in trunk has diverged significantly. Andy/Chris is there a regression test available (even one that requires JacORB)? (Note that 3024 is not resolved yet.)
closing, no fix and recent testing with jacorb shows stable usage