Bug 1912

Summary: ReplicationManager will not update the IOGR of every members if one of the members is down
Product: TAO Reporter: Sebastien Roy <sroy>
Component: Fault Tolerance ServiceAssignee: Sebastien Roy <sroy>
Status: NEW ---    
Severity: normal    
Priority: P3    
Version: 1.4.1   
Hardware: x86   
OS: Windows 2000   

Description Sebastien Roy 2004-08-20 14:24:11 CDT
I have 2 replicated servers, each of them being a member of the same group and 
registered in the replication manager. Server 1 is primary.
If kill server 1 and try to set server 2 to primary (by an external application 
connecting to the ReplicationManager), server 2 will not necessarily receive 
its updated IOGR via tao_update_object_group. This depend of the order of the 
members in the group.

When some group attributes change in the replication manager, it tries to 
notify all of its member with the new IOGR, in the 
PG_Object_Group::distribute_iogr() function. However, if the first member to be 
notified is the server that was killed, the TAO_UpdateObjectGroup::_narrow() 
function call will throw an exception that is trap at a higher level than 
distribute_iogr(). Because of that, the other members will not be notified with 
the new IOGR.

There was big comment chunk at this place in the code that looks related:
// Unchecked narrow means the member doesn't have to actually implement the 
TAO_UpdateObjectGroup interface
// PortableGroup::TAO_UpdateObjectGroup_var uog = 
PortableGroup::TAO_UpdateObjectGroup::_unchecked_narrow ( info->member_);
// but it doesn work: error message at replica is:
// TAO-FT (2996|976) - Wrong version information within the interceptor [1 | 0]
// TAO_Perfect_Hash_OpTable:find for operation 'tao_update_object_group' 
(length=23) failed
// back to using _narrow

On my end, I have put the narrow call in a try block, and when I catch an 
exception, I do "continue" to go on with the next member.

Thanks,
 Sébastien Roy
 Software Engineer
 Positron Public Safety Systems.
Comment 1 Clemens Krainer 2004-08-24 14:41:23 CDT
Hello,

I also experimented with the FT part of TAO last week and had the same problem.
In my opinion, the lines

   PortableGroup::TAO_UpdateObjectGroup_var uog = 
      PortableGroup::TAO_UpdateObjectGroup::_unchecked_narrow ( info->member_);

in distribute_iogr() should be the right code, because the invoked
tao_update_object_group() is a pseudo method handled by the
FT_ServerRequest_Interceptor.

In method FT_ServerRequest_Interceptor::receive_request_service_contexts() the
version is checked before tao_update_object_group() can update the version
information. This leads to the "wrong version" problem.
I think, that the first lines in the receive_request_service_contexts() method
should be:

  CORBA::String_var op = ri->operation (ACE_ENV_SINGLE_ARG_PARAMETER);
  // No version check if we receive new version information
  if (ACE_OS::strcmp (op.in (), "tao_update_object_group") == 0)
    {
       return;
    }



Kind regards,

Clemens.
Comment 2 Johnny Willemsen 2007-09-20 03:07:18 CDT
to reporter, can you provide a regression test as reproducer?