Bug 4060

Summary: [tao-bugs] Load Manager: After upgrading from 5.5.1 to 6.1.0 using round robin strategy in load manager causes load
Product: TAO Reporter: Rory Crawford <rory.crawford>
Component: Load BalancerAssignee: DOC Center Support List (internal) <tao-support>
Status: NEW ---    
Severity: normal    
Priority: P3    
Version: 2.1.0   
Hardware: x86   
OS: Linux   

Description Rory Crawford 2012-07-04 00:21:24 CDT
OVERVIEW:
    The load manager core dumps and our application then stops functioning as it relies on the load manager. The core dump only occurs (on listed platforms) when running against load manager compiled with no debug.  The core file  produces a useable stack trace and was what pointed us to the TAO_LB_RoundRobin::next_member function. When compiled with debug the load manager just returns a bad parameter message exception but does not core dump - no subsequent calls to the load manager are correctly serviced even though it continues running.

    SYNOPSIS:
    After upgrading from 5.5.1 to 6.1.0, we immediately encountered core dumps from the load manager when round robin load balancing was taking place.

    DESCRIPTION:

    After further investigation, we tracked this down to the method TAO_LB_RoundRobin::next_member when only a single location was provided and the method was called for a second time.

    STEPS TO REPRODUCE:
    Use load manager with RoundRobin strategy and only provide a single location.

    WHY THE PROBLEM OCCURS:
    1st call to TAO_LB_RoundRobin::next_member
    Line 096: As this is the first time, the location_index_map_ does not have an entry, so the if clause is bypassed.
    Line 160: location_index_map_ has the id added at position 1
    Line 163: copy_locations is called, which clears the last_locations_ vector and pushes the location onto the vector(i.e. effectively storing it at position 0).

    Note: this means that the last_locations_ vector has a single element stored at position 0.

    2nd call to TAO_LB_RoundRobin::next_member
    Line 096: id is found in the location_index_map_
    Line 098: variable i is set to 1 (as per value in location_index_map, see 1st call above)
    Line 110: loop variable k starts at 1 (as it is set from variable i).
    Line 114: string comparison core dumps (Linux, Solaris).

    Reason for core dump:

    Line 114: this->last_locations_[k][0].id.in()

    k is 1, and this position does not exist in the last_locations_ vector (see Note in 1st call section above), so trying to access a string at this position is not a good idea.

    REPEAT BY:
    As per above.

    SAMPLE FIX/WORKAROUND:
    We have temporarily resolved this issue by adding an additional sanity check clause to the loop in line 110:

    for (CORBA::ULong k = i; k > 0 && !found && k<last_locations_.size(); --k)

    Alternatively using the load managers Random strategy also works.