Bug 1622 - Race condition in TAO_SSLIOP_Endpoint can cause crash
Summary: Race condition in TAO_SSLIOP_Endpoint can cause crash
Status: NEW
Alias: None
Product: TAO
Classification: Unclassified
Component: SSLIOP Pluggable Protocol (show other bugs)
Version: 1.3
Hardware: x86 Windows 2000
: P5 normal
Assignee: DOC Center Support List (internal)
URL:
Depends on:
Blocks:
 
Reported: 2003-10-27 10:44 CST by david.kinder
Modified: 2006-04-21 12:15 CDT (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description david.kinder 2003-10-27 10:44:50 CST
SYNOPSIS:

Instances of TAO_SSLIOP_Endpoint contain a reference to a 
TAO_SSLIOP_Credentials object. This credentials object is not protected by a 
lock. According to the comment in SSLIOP_Endpoint.h, this is so that the 
transport cache can be scanned without additional locking.

However, the SSLIOP connector replaces the credentials object in an endpoint in 
TAO_SSLIOP_Connector::ssliop_connect(). If the endpoint is in the transport 
cache and is accessed by another thread while the credentials object is being 
replaced, the application can dereference an invalid credentials pointer, and 
crash.

DESCRIPTION:

We have been running tests on our multithreaded distributed application, which 
is implemented using CORBA over SSLIOP. In one test, a very occasional crash is 
seen, which I believe is due to a race condition in the SSLIOP implementation.

The thread that crashes is always scanning the transport cache. The relevant 
parts of the top of the thread's call stack are:

  tao_ssliopd.dll!TAO_SSLIOP_Endpoint::hash()
  taod.dll!TAO_Base_Transport_Property::hash()
  taod.dll!TAO_Cache_ExtId::hash()
  taod.dll!ACE_Hash<TAO_Cache_ExtId>::operator()()
  ...
  taod.dll!TAO_Transport_Cache_Manager::bind_i()
  taod.dll!TAO_Transport_Cache_Manager::cache_idle_transport()

Analysis of the crashed application shows that the crash occurs when the 
TAO_SSLIOP_Endpoint::hash() attempts to dereference what it believes is a 
pointer to a TAO_SSLIOP_Credentials object. This pointer is invalid, leading to 
the failure.

Further analysis of the machine code generated for the hash() function shows 
that, as one of the first steps in the function, the credentials_ member 
variable of the TAO_SSLIOP_Endpoint object (i.e. the pointer to the credentials 
object) is loaded into a register. At this point, the pointer is valid. 
However, about 10 instructions later the pointer value in the register is 
dereferenced. At this point, the pointer is invalid, and does not match the 
value in the credentials_ member any more. The value in credentials_ has 
changed, not the value in the register.

The conclusion I drew from the above was that another thread was replacing the 
credentials object in the endpoint while the crashing thread was calling hash
(). The replacement of the credentials object led to the previous credentials 
object being deleted, even though the thread in the hash() method was using it.

Analysis of the other threads in the application shows that one other thread is 
attempting to make a call to a remote CORBA object via SSLIOP. The top of this 
thread's call stack is

  tao_ssliopd.dll!TAO_SSLIOP_Connector::ssliop_connect()
  tao_ssliopd.dll!TAO_SSLIOP_Connector::connect()
  taod.dll!TAO_GIOP_Invocation::perform_call()
  taod.dll!TAO_Default_Endpoint_Selector::endpoint_from_profile()
  taod.dll!TAO_Default_Endpoint_Selector::select_endpoint()
  taod.dll!TAO_GIOP_Invocation::start()
  taod.dll!TAO_GIOP_Twoway_Invocation::start()

The endpoint passed to ssliop_connect() is the endpoint of the local ORB, so it 
is present in the transport cache. Looking at the ssliop_connect() code, after 
setting up a service handler, the connector makes the following calls to set up 
this local endpoint:

  ssl_endpoint->qop (qop);
  ssl_endpoint->trust (trust);
  ssl_endpoint->credentials (credentials.in ());

It appears that the last of these lines is the problem. If this line is 
executed while another thread is in TAO_SSLIOP_Endpoint::hash(), the updating 
of the credentials object causes the thread in hash() to fail, as it is not 
expecting the credentials object to be replaced, and has no way of protecting 
against it.

As an aside, this problem was previously masked by bug #1508, which resulted in 
the credentials object being nil in almost all circumstances.

This was all tested with TAO 1.3.0, but looking at CVS the problem is still 
present in the latest version.

SAMPLE FIX/WORKAROUND:

Having got this far, I was intending to produce a patch to fix this problem. 
However, there is a problem: If the requirement is that the transport cache can 
be scanned without holding additional locks, there is no completely safe way to 
replace the credentials member of the SSLIOP endpoint. I can see a couple of 
possible approaches:

1) Don't call the credentials object in TAO_SSLIOP_Endpoint::hash(). This is 
the temporary solution I've adopted, and it does prevent the crash in our test 
harness. However, calling other TAO_SSLIOP_Endpoint methods that access the 
credentials object could equally well cause this crash, so it is hardly an 
ideal solution.

2) Hold a reference to the old credentials object in ssliop_connect(), and only 
delete the old credentials object at the end of the function. This should 
ensure that any function using the old credentials object will exit before the 
credentials object is deleted. However, it does not guarantee that this will 
happen: it just relies on functions that access the credentials object only 
doing so for a short time. It could also fail as an approach on any 
architecture that cannot read or write a pointer value as an atomic operation.

3) Don't update the credentials object in the endpoint in ssliop_connect() if 
one is already present in the endpoint. Since the endpoint in question is that 
of the local ORB, it isn't clear to me whether it would be valid for the 
credentials on this endpoint to ever change, once created.

4) Protect access to the credentials_ member of TAO_SSLIOP_Endpoint with a 
lock. This should solve the problem, but will mean locks being acquired while 
scanning the transport cache. That could be addressed by either not including 
the credentials hash in the endpoint hash, or by precomputing the credentials 
hash and storing it as a member in the endpoint, so that the lock would not 
need to be held while calculating the hash.

If someone could provide guidance on the approach to take, I will happily look 
into generating a patch to solve all this.
Comment 1 Ossama Othman 2003-11-06 15:08:57 CST
I'll look into the problem.
Comment 2 david.kinder 2004-02-09 05:01:28 CST
The caching of endpoint hash values implemented for 1.4.0 means that this 
problem doesn't show up any more. It is still theoretically possible that a 
race condition could occur though, if a method accessed the endpoint's 
credentials object while the SSLIOP Connector was replacing it.