Bug 3704 - Servers with blocking flushing strategy can loop with 100% CPU usage
Summary: Servers with blocking flushing strategy can loop with 100% CPU usage
Status: RESOLVED FIXED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.6.9
Hardware: All Solaris
: P3 normal
Assignee: Adam Mitz
URL:
Depends on:
Blocks: 3682
  Show dependency tree
 
Reported: 2009-06-18 10:41 CDT by Adam Mitz
Modified: 2009-06-26 15:45 CDT (History)
1 user (show)

See Also:


Attachments
Proposed patch (5.66 KB, patch)
2009-06-18 14:41 CDT, Adam Mitz
Details
Revised patch, using the new approach (20.62 KB, patch)
2009-06-25 15:27 CDT, Adam Mitz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Mitz 2009-06-18 10:41:22 CDT
By inspection all the parts of this bug are still present in the DOC 
group trunk.
- server-side sockets are non-blocking
- blocking flushing strategy just runs a loop around handle_output
- the OS can return -1/EWOULDBLOCK at any time for a non-blocking socket

Therefore TAO can be stuck in this loop at 100% CPU for an arbitrary 
length of time.

Proposed solution:
TAO_Block_Flushing_Strategy::flush_transport() and flush_message() should put the socket in blocking mode and restore the previous mode before they return.  This could be done in the "guard" style similar to mutex lock/unlock.
Comment 1 Adam Mitz 2009-06-18 14:41:02 CDT
Created attachment 1167 [details]
Proposed patch
Comment 2 Johnny Willemsen 2009-06-19 01:35:08 CDT
get_blocking_io can be const
Comment 3 Adam Mitz 2009-06-19 08:50:59 CDT
(In reply to comment #2)
> get_blocking_io can be const
> 

OK, I changed it in my local workspace.
Comment 4 Vladimir Zykov 2009-06-24 08:22:56 CDT
After the fix for this bug was reverted the fix for 3682 doesn't work any more. The problem is that send can return on solaris with errno=EAGAIN and in this case drain_queue_helper() returns 0 which leads to handle_output() returns 0 too and the reactor doesn't call this handle any more.
Comment 5 Adam Mitz 2009-06-24 08:40:38 CDT
(In reply to comment #4)
> After the fix for this bug was reverted the fix for 3682 doesn't work any more.

The fix for this bug was never committed, so it was also never reverted.

> The problem is that send can return on solaris with errno=EAGAIN and in this
> case drain_queue_helper() returns 0 which leads to handle_output() returns 0
> too and the reactor doesn't call this handle any more.

That should be fixed once this fix is indeed committed.  Do you see the 3682 regression test failing now (after June 18)?
Comment 6 Vladimir Zykov 2009-06-24 08:56:37 CDT
(In reply to comment #5)
> (In reply to comment #4)
> > After the fix for this bug was reverted the fix for 3682 doesn't work any more.
> 
> The fix for this bug was never committed, so it was also never reverted.

Sorry, I didn't read carefully the changelog for commit 85701 which mentions this bug. Actually the above commit broke the fix for 3682.

> 
> > The problem is that send can return on solaris with errno=EAGAIN and in this
> > case drain_queue_helper() returns 0 which leads to handle_output() returns 0
> > too and the reactor doesn't call this handle any more.
> 
> That should be fixed once this fix is indeed committed.  Do you see the 3682
> regression test failing now (after June 18)?
> 

3682 doesn't have own test but it's reproduced by Single_Read and AMH_Oneway tests. They are failing now.
Comment 7 Adam Mitz 2009-06-24 09:02:35 CDT
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > After the fix for this bug was reverted the fix for 3682 doesn't work any more.
> > 
> > The fix for this bug was never committed, so it was also never reverted.
> 
> Sorry, I didn't read carefully the changelog for commit 85701 which mentions
> this bug. Actually the above commit broke the fix for 3682.
> 
> > 
> > > The problem is that send can return on solaris with errno=EAGAIN and in this
> > > case drain_queue_helper() returns 0 which leads to handle_output() returns 0
> > > too and the reactor doesn't call this handle any more.
> > 
> > That should be fixed once this fix is indeed committed.  Do you see the 3682
> > regression test failing now (after June 18)?
> > 
> 
> 3682 doesn't have own test but it's reproduced by Single_Read and AMH_Oneway
> tests. They are failing now.
> 

OK, I think those should be fixed once I commit this patch -- which should be today.  I will be checking all of the tests once I commit it but please let me know if I miss something.
Comment 8 Adam Mitz 2009-06-24 16:39:24 CDT
After discussing this with a colleague I'm going to try a different approach.  

The problem with the current proposed patch is that it sets the socket to blocking mode which will also impact the read side of the socket.  The read side is registered with the reactor, so if the flushing strategy were to set blocking mode in between the reactor noticing the handle is ready and handle_input making the read call, that read could block (depending on how much data is available) and the thread would be stuck in handle_input.

The new approach would be to use ACE::handle_write_ready() in the blocking flushing strategy.  It requires the ACE_HANDLE so it can't be used directly but will need to delegate to the connection handler.  I'll start working on the code for this.
Comment 9 Adam Mitz 2009-06-25 15:27:04 CDT
Created attachment 1169 [details]
Revised patch, using the new approach
Comment 10 Vladimir Zykov 2009-06-26 04:53:14 CDT
(In reply to comment #9)
> Created an attachment (id=1169) [details]
> Revised patch, using the new approach
> 

I checked this patch together with 3682. On solaris those two Single_Read and AMH_Oneway do work. I'm trying to check other tests I selected for testing 3682 on solaris.
Comment 11 Vladimir Zykov 2009-06-26 06:35:53 CDT
With this patch and maximum debug level I get sigill on solaris in line Transport.cpp:1104

          ACE_DEBUG ((LM_DEBUG,
              ACE_TEXT ("TAO (%P|%t) - Transport[%d]::drain_queue_i, ")
              ACE_TEXT ("helper retval = %d\n"),
              this->id (), retval));

and the same thing is in line Transport.cpp:1077

              ACE_DEBUG ((LM_DEBUG,
                 ACE_TEXT ("TAO (%P|%t) - Transport[%d]::drain_queue_i, ")
                 ACE_TEXT ("helper retval = %d\n"),
                 this->id (), retval));

retval is object now not int.
Comment 12 Adam Mitz 2009-06-26 06:41:02 CDT
(In reply to comment #11)
> With this patch and maximum debug level I get sigill on solaris in line
> Transport.cpp:1104
> 
> and the same thing is in line Transport.cpp:1077
> 
> retval is object now not int.
> 

Thanks, I'll fix those.
Comment 13 Adam Mitz 2009-06-26 15:45:40 CDT
Fri Jun 26 20:41:53 UTC 2009  Adam Mitz  <mitza@ociweb.com>

        * tao/Block_Flushing_Strategy.h:
        * tao/Block_Flushing_Strategy.cpp:
        * tao/Connection_Handler.h:
        * tao/Connection_Handler.inl:
        * tao/Connection_Handler.cpp:
        * tao/IIOP_Connection_Handler.h:
        * tao/IIOP_Connection_Handler.cpp:
        * tao/Transport.h:
        * tao/Transport.cpp:

          This resolves bug #3704.
          TAO_Transport::handle_output() now uses an enum return value to
          distinguish the EWOULDBLOCK case from other normal or error
          conditions.  The Block_Flushing_Strategy checks for EWOULDBLOCK
          and calls a new method, handle_write_ready(), on the Connection
          Handler.  The Connection Handler calls ACE::handle_write_ready()
          passing the specific handle.

        * tao/Strategies/DIOP_Connection_Handler.h:
        * tao/Strategies/DIOP_Connection_Handler.cpp:
        * tao/Strategies/SCIOP_Connection_Handler.h:
        * tao/Strategies/SCIOP_Connection_Handler.cpp:
        * tao/Strategies/SHMIOP_Connection_Handler.h:
        * tao/Strategies/SHMIOP_Connection_Handler.cpp:
        * tao/Strategies/UIOP_Connection_Handler.h:
        * tao/Strategies/UIOP_Connection_Handler.cpp:
        * orbsvcs/orbsvcs/HTIOP/HTIOP_Connection_Handler.h:
        * orbsvcs/orbsvcs/HTIOP/HTIOP_Connection_Handler.cpp:
        * orbsvcs/orbsvcs/PortableGroup/UIPMC_Connection_Handler.h:
        * orbsvcs/orbsvcs/PortableGroup/UIPMC_Connection_Handler.cpp:
        * orbsvcs/orbsvcs/PortableGroup/UIPMC_Mcast_Connection_Handler.h:
        * orbsvcs/orbsvcs/PortableGroup/UIPMC_Mcast_Connection_Handler.cpp:
        * orbsvcs/orbsvcs/SSLIOP/SSLIOP_Connection_Handler.h:
        * orbsvcs/orbsvcs/SSLIOP/SSLIOP_Connection_Handler.cpp:

          Added handle_write_ready() to the non-IIOP protocols.