1647 – Message corruption with SSLIOP and TP_Reactor

Bug 1647 - Message corruption with SSLIOP and TP_Reactor

Summary: Message corruption with SSLIOP and TP_Reactor

Status:	RESOLVED FIXED

Alias:	None

Product:	TAO
Classification:	Unclassified
Component:	SSLIOP Pluggable Protocol (show other bugs)
Version:	1.3.5
Hardware:	All All

Importance:	P3 blocker
Assignee:	Ossama Othman

URL:

Duplicates (1):	2320 (view as bug list)
Depends on:
Blocks:

Reported:	2003-11-13 12:19 CST by Ossama Othman
Modified:	2007-02-19 13:01 CST (History)
CC List:	2 users (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ossama Othman 2003-11-13 12:19:45 CST

From Eric Malenfant <EMalenfant@interstarinc.com>:

    TAO VERSION: 1.3.1
    ACE VERSION: 5.3.1

    HOST MACHINE and OPERATING SYSTEM:
        Intel P4, Windows 2000 Pro
        Winsock 2

    TARGET MACHINE and OPERATING SYSTEM, if different from HOST: 
    <same>
    
    COMPILER NAME AND VERSION (AND PATCHLEVEL):
    MSVC 6, SP5

    AREA/CLASS/EXAMPLE AFFECTED:
    ACE_TP_Reactor and TAO_SSLIOP

    DOES THE PROBLEM AFFECT:
        COMPILATION? no
        LINKING? no
        EXECUTION? yes (see description below)

    SYNOPSIS:
ACE_TP_Reactor may end up dispatching same handle on two threads, causing
message corruption in TAO_SSLIOP.

    DESCRIPTION:
First, a little background:

- When ACE_Event_Handler::handle_input() returns > 0, the ACE_Reactor should
  call it again.

- The ACE_TP_Reactor does this with a straightforward "while" loop,
something
  like:
    int status;
    do{
        status = (event_handler->*callback) (handle);
    while (status > 0);

- With the ACE_TP_Reactor, ACE_Event_Handlers can resume their own handle
  before handle_input() returns.

- TAO_SSLIOP_Connection_Handler::handle_input() can return > 0 if its 
  pos_io_hook() detects that there is still data in openssl's buffer.


Now, consider this sequence:

(Thread A) is running the TP_Reactor, and wakes up as data arrives on an 
           SSLIOP connection.
(Thread A) calls TAO_SSLIOP_Connection_Handler, which reads a message, resumes
           the handle, and sees that there is data left in openssl's buffer.
(Thread B) enters the TP_Reactor and wakes up as more data arrived on the same
           SSLIOP connection
(Thread B) starts reading the message.
(Thread A) TAO_SSLIOP_Connection_Handler::handle_input() return 1. As a 
           consequence, TP_Reactor reinvokes it immediately.
-> We now have two threads reading the same handle simultaneously, potentially
   leading to very weird things. In our tests, we saw corrupted messages
   being dispatched to the servant.

    REPEAT BY:
We "discovered" that problem by running a "congestion" test where a 
multithreaded client sends an octet sequence (of approx. 500 bytes) 
and a long to a multithreaded server, using SSLIOP. 
The server sleeps for the time indicated by the second argument (the long).
We rapidly saw that the server was receiving "impossible" values for its
sleep time, and saw that this was because of message corruption on the
server side, because two threads were simultaneously reading on the same
connection.

    SAMPLE FIX/WORKAROUND:
Assuming that ACE_Event_Handler::handle_input() returning > 0 does not mean
"Call me again immediately", but more something like "Leave my handle in the

ready set so that I will be called again next time someone wants to dispatch
an event", we tought about doing just that, i.e., putting the Handler's handle
in the Reactor's ready set.

However, the TP_Reactor flushes the ready set when its "state_changed_"
member is set, as a result of "register/remove_handler()". Also, we felt that
this may cause fairness problems. So we expanded the idea a bit and added
another handle set to the TP_Reactor (called the "scheduled_set"). Handles of
Event_Handlers returning > 0 are put in that scheduled_set, and the 
TP_Reactor's dispatching loop was modified so that, when the ready_set is
empty, instead of immediately going to wait into select(), we:
- Look in the scheduled_set
- If it is not empty, we call select() with a zero timeout (so that other
  handles have a chance to have their events processed even if another
  handle stays a long time in the "scheduled_set")
- Merge the result from select() and the contents of the scheduled_set
  into the ready_set
- Dispatch events from the ready_set, removing the handles from the ready_set
  and the scheduled_set as they are dispatched.

Comment 1 Ossama Othman 2003-11-26 09:52:33 CST

Eric has some valid points.

Comment 2 zhangw 2006-04-12 16:45:33 CDT

This bug is fixed. See the ChangeLog.

Mon Mar 27 18:55:51 UTC 2006  Wallace Zhang  <zhangw@ociweb.com>

        * orbsvcs/tests/Security/MT_SSLIOP/run_test_harsh.pl:
        * tao/Connection_Handler.cpp:
        * tao/Resume_Handle.h:
        * tao/Resume_Handle.cpp:

          Merged in fixes from OCI 1.4a.
          Tue Feb 21 16:29:32 UTC 2006  Don Busch  <busch_d@ociweb.com>

           * tao/Connection_Handler.cpp
           * tao/Resume_Handle.h
           * tao/Resume_Handle.cpp
           * orbsvcs/tests/Security/MT_SSLIOP/run_test_harsh.pl

           RT8248(Bug 1647) is a race condition involving two threads
           active in the same connection handler at the same time.
           The race is fixed by
           ensuring that a connection handler that has allowed its handle
           to be resumed in the Reactor does not return "1" from
           handle_input. "1" is the Reactor's "call me back immediately"
           value. Essentially, you can't give up ownership of yourself twice
           --  you give up owner-ship when you resume the handle, so you
           can't ask to be called back immediately.  (The SSLIOP handler
           is the only one that ever returns 1, so that's the only handler
           in which this manifests itself)

           The additional test (run_test_harsh.pl) is a longer (~5 minute)
           version of the MT_SSLIOP test that fails without this change,
           but succeeds with it.

           Also moved the code for Ciju's "connection close" fix of
           "Fri Dec 16 14:40:54 2005" (this entry is moved from OCI 1.4a)
           from the Connection_Handler.cpp to the Resume_Handle.cpp.
             Fri Dec 16 14:40:54 2005  Ciju John  <john_c@ociweb.com>

             * tao/Connection_Handler.cpp:

               When the client closes its connections, it sends an EOF
               to the server. The sever reactor gets an event for that
               handler. It first suspends the handler and then dispatches
               a thread to handle_input().Some where down the lane, a read()
               is done on the handle and automatically the handle is resumed.
               At this point another thread CAN enter the reactor and
               since the handle being closed has been resumed, the reactor
               will add it to its fd_set. As expected the reactor gets an
               event on that handle and dispatches someone else to handle
               that event.
               What this patch does is prevent the closing handler from being
               resumed. Thus it never does get added to the reactor
               fd_set and the possibility of getting a nil handler is removed.

Comment 3 Johnny Willemsen 2007-02-19 13:01:06 CST

*** Bug 2320 has been marked as a duplicate of this bug. ***