Please report new issues athttps://github.com/DOCGroup
From Eric Malenfant <EMalenfant@interstarinc.com>: TAO VERSION: 1.3.1 ACE VERSION: 5.3.1 HOST MACHINE and OPERATING SYSTEM: Intel P4, Windows 2000 Pro Winsock 2 TARGET MACHINE and OPERATING SYSTEM, if different from HOST: <same> COMPILER NAME AND VERSION (AND PATCHLEVEL): MSVC 6, SP5 AREA/CLASS/EXAMPLE AFFECTED: ACE_TP_Reactor and TAO_SSLIOP DOES THE PROBLEM AFFECT: COMPILATION? no LINKING? no EXECUTION? yes (see description below) SYNOPSIS: ACE_TP_Reactor may end up dispatching same handle on two threads, causing message corruption in TAO_SSLIOP. DESCRIPTION: First, a little background: - When ACE_Event_Handler::handle_input() returns > 0, the ACE_Reactor should call it again. - The ACE_TP_Reactor does this with a straightforward "while" loop, something like: int status; do{ status = (event_handler->*callback) (handle); while (status > 0); - With the ACE_TP_Reactor, ACE_Event_Handlers can resume their own handle before handle_input() returns. - TAO_SSLIOP_Connection_Handler::handle_input() can return > 0 if its pos_io_hook() detects that there is still data in openssl's buffer. Now, consider this sequence: (Thread A) is running the TP_Reactor, and wakes up as data arrives on an SSLIOP connection. (Thread A) calls TAO_SSLIOP_Connection_Handler, which reads a message, resumes the handle, and sees that there is data left in openssl's buffer. (Thread B) enters the TP_Reactor and wakes up as more data arrived on the same SSLIOP connection (Thread B) starts reading the message. (Thread A) TAO_SSLIOP_Connection_Handler::handle_input() return 1. As a consequence, TP_Reactor reinvokes it immediately. -> We now have two threads reading the same handle simultaneously, potentially leading to very weird things. In our tests, we saw corrupted messages being dispatched to the servant. REPEAT BY: We "discovered" that problem by running a "congestion" test where a multithreaded client sends an octet sequence (of approx. 500 bytes) and a long to a multithreaded server, using SSLIOP. The server sleeps for the time indicated by the second argument (the long). We rapidly saw that the server was receiving "impossible" values for its sleep time, and saw that this was because of message corruption on the server side, because two threads were simultaneously reading on the same connection. SAMPLE FIX/WORKAROUND: Assuming that ACE_Event_Handler::handle_input() returning > 0 does not mean "Call me again immediately", but more something like "Leave my handle in the ready set so that I will be called again next time someone wants to dispatch an event", we tought about doing just that, i.e., putting the Handler's handle in the Reactor's ready set. However, the TP_Reactor flushes the ready set when its "state_changed_" member is set, as a result of "register/remove_handler()". Also, we felt that this may cause fairness problems. So we expanded the idea a bit and added another handle set to the TP_Reactor (called the "scheduled_set"). Handles of Event_Handlers returning > 0 are put in that scheduled_set, and the TP_Reactor's dispatching loop was modified so that, when the ready_set is empty, instead of immediately going to wait into select(), we: - Look in the scheduled_set - If it is not empty, we call select() with a zero timeout (so that other handles have a chance to have their events processed even if another handle stays a long time in the "scheduled_set") - Merge the result from select() and the contents of the scheduled_set into the ready_set - Dispatch events from the ready_set, removing the handles from the ready_set and the scheduled_set as they are dispatched.
Eric has some valid points.
This bug is fixed. See the ChangeLog. Mon Mar 27 18:55:51 UTC 2006 Wallace Zhang <zhangw@ociweb.com> * orbsvcs/tests/Security/MT_SSLIOP/run_test_harsh.pl: * tao/Connection_Handler.cpp: * tao/Resume_Handle.h: * tao/Resume_Handle.cpp: Merged in fixes from OCI 1.4a. Tue Feb 21 16:29:32 UTC 2006 Don Busch <busch_d@ociweb.com> * tao/Connection_Handler.cpp * tao/Resume_Handle.h * tao/Resume_Handle.cpp * orbsvcs/tests/Security/MT_SSLIOP/run_test_harsh.pl RT8248(Bug 1647) is a race condition involving two threads active in the same connection handler at the same time. The race is fixed by ensuring that a connection handler that has allowed its handle to be resumed in the Reactor does not return "1" from handle_input. "1" is the Reactor's "call me back immediately" value. Essentially, you can't give up ownership of yourself twice -- you give up owner-ship when you resume the handle, so you can't ask to be called back immediately. (The SSLIOP handler is the only one that ever returns 1, so that's the only handler in which this manifests itself) The additional test (run_test_harsh.pl) is a longer (~5 minute) version of the MT_SSLIOP test that fails without this change, but succeeds with it. Also moved the code for Ciju's "connection close" fix of "Fri Dec 16 14:40:54 2005" (this entry is moved from OCI 1.4a) from the Connection_Handler.cpp to the Resume_Handle.cpp. Fri Dec 16 14:40:54 2005 Ciju John <john_c@ociweb.com> * tao/Connection_Handler.cpp: When the client closes its connections, it sends an EOF to the server. The sever reactor gets an event for that handler. It first suspends the handler and then dispatches a thread to handle_input().Some where down the lane, a read() is done on the handle and automatically the handle is resumed. At this point another thread CAN enter the reactor and since the handle being closed has been resumed, the reactor will add it to its fd_set. As expected the reactor gets an event on that handle and dispatches someone else to handle that event. What this patch does is prevent the closing handler from being resumed. Thus it never does get added to the reactor fd_set and the possibility of getting a nil handler is removed.
*** Bug 2320 has been marked as a duplicate of this bug. ***