Bug 2895 - TP_Reactor breaks promise of dispatching socket event to only one thread
Summary: TP_Reactor breaks promise of dispatching socket event to only one thread
Status: RESOLVED FIXED
Alias: None
Product: ACE
Classification: Unclassified
Component: ACE Core (show other bugs)
Version: 5.5.7
Hardware: All All
: P3 enhancement
Assignee: DOC Center Support List (internal)
URL:
Depends on: 2395
Blocks:
  Show dependency tree
 
Reported: 2007-04-13 09:16 CDT by Joe Guan
Modified: 2007-05-14 01:55 CDT (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Joe Guan 2007-04-13 09:16:20 CDT
Hi Joe,

I see your point about adding the additional check. The hard point is what
an additional check would mean for other users. I can't tell at this moment
whether it will do any harm or not. Maybe other people have made other
assumptions. Could you store your remarks and ideas in bugzilla, then we at
least don't loose this and we can then think a little bit whether this has
impact or not.

Johnny


>> Thanks for the information on the related bug.
>> 
>> Just looked through that bug entry and the fix put in ACE code,
>> it works with one assumption (or requirement):
>> 
>> in handle_nnn (nnn: input, output, exception, not including timeout)
>> NEVER call reactor's remove_handler. (as indicated in TP_Reactor.h).
>> 
>> 
>> Back in our case, because the piece of code was written well before
>> ACE introduces reference counting on event handler objects, it does
>> its own reference counting around calls like remove_handler in 
>> handle_input, thus breaks the "later" assumption. Unfortunately there
>> isn't any compile-time assertions (it could be hard to implement one)
>> or other ways for the enforcement of that assumption. (or out of my
>> awareness simply)
>> 
>> It may be worth for you to consider making stricter check when resume
>> a handle to ease that assumption (for event handler objects enabling
>> reference counting, this assumption could be totally 
>> eliminated without
>> compromising correctness) as I put in the "SAMPLE FIX" section, though
>> it has a little performance penalty (search and compare); 
>> besides, it's
>> the fastest fix in our case.
>> 
>> Thanks,
>> Joe
>> 
>> 
>> 
>> Johnny Willemsen wrote:
>
>>> > Hi,
>>> > 
>>> > Thanks for your PRF and the detailed synopsis. I thought 
>
>> this was fixed by
>
>>> > bug 2395 which is also referred to in the code. Do you 
>
>> return -1 from the
>
>>> > handle_input? If yes, can you file this issue in bugzilla (see
>>> > http://deuce.doc.wustl.edu/bugzilla/index.cgi) and have a look at
>>> > ACE_wrappers/examples/Reactor/TP_Reactor. This example is 
>
>> the reproducer
>
>>> > program for bug 2395 and I wonder what the difference is 
>
>> from your use case
>
>>> > and this test program. This reactor code is real critical 
>
>> path code and we
>
>>> > have to be 100% sure that this is a valid fix and doesn't 
>
>> break applications
>
>>> > that behave well before we can commit it.
>>> > 
>>> > Regards,
>>> > 
>>> > Johnny Willemsen
>>> > Remedy IT
>>> > Postbus 101
>>> > 2650 AC  Berkel en Rodenrijs
>>> > The Netherlands
>>> > www.theaceorb.nl / www.remedy.nl  
>>> > 
>>> > 
>>> > 
>>
>>>> >>-----Original Message-----
>>>> >>From: ace-users-bounces@cse.wustl.edu 
>>>> >>[mailto:ace-users-bounces@cse.wustl.edu] On Behalf Of Joe
>>>> >>Sent: Thursday, April 12, 2007 1:37 AM
>>>> >>To: ace-users@cse.wustl.edu
>>>> >>Subject: [ace-users] TP_Reactor breaks promise of ...
>>>> >>
>>>> >>ACE VERSION: 5.5.7
>>>> >>
>>>> >>HOST MACHINE and OPERATING SYSTEM:
>>>> >>   AMD Opteron(tm) Processor 252, 2.4GHz
>>>> >>   RHEL4, Linux 2.6.9-42.0.10.ELsmp #1 SMP
>>>> >>
>>>> >>TARGET MACHINE and OPERATING SYSTEM, if different from HOST:
>>>> >>
>>>> >>COMPILER NAME AND VERSION (AND PATCHLEVEL):
>>>> >>   g++ (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3)
>>>> >>
>>>> >>THE $ACE_ROOT/ace/config.h FILE
>>>> >>   #include "ace/config-linux.h"
>>>> >>
>>>> >>
>>>> >>THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE
>>>> >>   include $(ACE_ROOT)/include/makeinclude/platform_linux.GNU
>>>> >>
>>>> >>CONTENTS OF $ACE_ROOT/bin/MakeProjectCreator/config/default.features
>>>> >>     (used by MPC when you generate your own makefiles):
>>>> >>
>>>> >>AREA/CLASS/EXAMPLE AFFECTED:
>>>> >>   ace/TP_Reactor.cpp
>>>> >>
>>>> >>
>>>> >>DOES THE PROBLEM AFFECT:
>>>> >>   COMPILATION?  no
>>>> >>   LINKING? no
>>>> >>   On Unix systems, did you run make realclean first? yes
>>>> >>   EXECUTION? yes
>>>> >>   OTHER (please specify)?
>>>> >>
>>>> >>[Please indicate whether ACE, your application, or both are 
>
>> affected.]
>
>>>> >>   Application is affected.
>>>> >>
>>>> >>SYNOPSIS:
>>>> >>   ACE promises for socket events, the TP_Reactor would 
>>>> >>dispatch that to
>>>> >>   only one thread. This is broken under certain situation described
>>>> >>   below.
>>>> >>
>>>> >>DESCRIPTION:
>>>> >>   In our application, we saw two threads are responding to the same
>>>> >>   incoming message from one TCP connection:
>>>> >>
>>>> >>   1> connection 1 (e.g. sockfd = 12) is closed by peer, 
>
>> thread T1 is
>
>>>> >>      dispatched to handle this event with event handler object eh1
>>>> >>      (derived from ACE_Svc_Handler), so eh1->handle_input 
>
>> is invoked;
>
>>>> >>
>>>> >>   2> T1 is doing the cleaning, close connection 1 (fd=12);
>>>> >>      (as to why we close connection 1 at this time is 
>>>> >>mentioned later)
>>>> >>
>>>> >>   3> After dispatching, T1 returns to ACE, it's gonna 
>>>> >>acquire a token,
>>>> >>      then to resume_i (h=12), but right before acquire a 
>>>> >>token, context
>>>> >>      switch happens;
>>>> >>
>>>> >>   4> The peer application (the client) is creating a new 
>
>> connection,
>
>>>> >>      unfortunately, the new connection has fd=12 too; then the peer
>>>> >>      sends a message through this new connection; new event handler
>>>> >>      object eh2 is created;
>>>> >>
>>>> >>   5> T2 is dispatched with eh2, fd=12, T2 now enters 
>>>> >>eh2->handle_input,
>>>> >>      which means connection2 with fd=12 has already been 
>>>> >>suspended; now
>>>> >>      context switch happens to T1;
>>>> >>
>>>> >>   6> T1 tries to resume the handling of connection 1, 
>
>> which has fd=12
>
>>>> >>      too. Now though TP_Reactor has the knowledge of 
>
>> fd1/eh1, it just
>
>>>> >>      blindly resume connection2, since there is a eh2 
>
>> associated with
>
>>>> >>      it, this resumption would succeed.
>>>> >>
>>>> >>   7> Now, if T2 hasn't read data from connection 2, 
>
>> another thread T3
>
>>>> >>      could be dispatched with eh2, connection2 again at the 
>>>> >>same time.
>>>> >>
>>>> >>   Retrospect on step 2>, if connection is not closed until eh1 is
>>>> >>   destroyed (happens after connection resumption in step 
>
>> 6>), we are
>
>>>> >>   good, the above is not happening; but, in some applications, this
>>>> >>   is not acceptable (as in ours):
>>>> >>   -- after resumption, eh->remove_reference may not be able 
>>>> >>to destroy
>>>> >>      eh object, the actual connection time is undeterministic;
>>>> >>   -- the server application has to make each connection as 
>>>> >>short-lived
>>>> >>      as possible to serve enormous amount of clients;
>>>> >>   -- ACE makes an additional unnecessary assumption: app 
>
>> should well
>
>>>> >>      behave, close a conn before an eh object destroyed is 
>>>> >>not accpeted.
>>>> >>
>>>> >>   I hope I make some sense in this report.
>>>> >>
>>>> >>REPEAT BY:
>>>> >>
>>>> >>SAMPLE FIX/WORKAROUND:
>>>> >>in ace/TP_Reactor.cpp: line 629:
>>>> >>
>>>> >>   if (dispatch_info.event_handler_ != this->notify_handler_ &&
>>>> >>       dispatch_info.resume_flag_ ==
>>>> >>             ACE_Event_Handler::ACE_REACTOR_RESUMES_HANDLER)
>>>> >>     this->resume_i (dispatch_info.handle_);
>>>> >>
>>>> >>should be changed as:
>>>> >>
>>>> >>   if (dispatch_info.event_handler_ != this->notify_handler_ &&
>>>> >>       dispatch_info.resume_flag_ ==
>>>> >>             ACE_Event_Handler::ACE_REACTOR_RESUMES_HANDLER &&
>>>> >>       this->handler_rep_.find (dispatch_info.handle_) ==
>>>> >>             dispatch_info.event_handler_ )
>>>> >>   {
>>>> >>     this->resume_i (dispatch_info.handle_);
>>>> >>   }
>>>> >>
>>>> >>
>>>> >>Thanks,
>>>> >>Joe
>>>> >>
>>>> >>
Comment 1 Johnny Willemsen 2007-05-14 01:55:13 CDT
Fri May 11 21:49:01 UTC 2007  Ossama Othman  <ossama_othman at symantec dot com>

        * ace/TP_Reactor.cpp (dispatch_i):

          Return the number of event handlers dispatched (typically 1).
          Previously zero was returned regardless of whether or not
          socket/IO event handlers were dispatched.

          (post_process_socket_event):

          Only remove or resume the event handler used during the upcall.
          A different event handler may have been registered during the
          upcall if the handle was closed and then reopened, for example.
          Make sure we're removing and/or resuming the event handler
          used during the upcall.

        * tests/Reactor_Remove_Resume_Test.cpp:

          New test that verifies reactors only remove and/or resume the
          event handler used during the upcall.  It is generally only
          relevant for thread pool based reactors.

        * tests/run_test.lst:
        * tests/tests.mpc:

          Added new Reactor_Remove_Resume_Test.