Bug 2781

Summary: No support for never-blocking oneways
Product: TAO Reporter: dfaure
Component: ORBAssignee: dfaure
Status: NEW ---    
Severity: major    
Priority: P3    
Version: 1.5.5   
Hardware: PowerPC   
OS: other   
Attachments: automated test for many oneway calls to a busy receiver
results on linux
Mac results (this test is not really reliable if it sometimes fails due to timing reasons...)

Description dfaure 2007-01-25 03:22:36 CST
There doesn't seem to be a way to make notifications using oneway calls in a 
way that would never block the sender - including when the receiver is busy and 
isn't able to receive calls for a long period of time. I'm looking for a 
real "fire and forget" notification, never blocking -- and if there are too 
many notifications being sent out then it would even be OK to lose a few.



What is happening (with the default policies) is that after the sender sends 
around 1280 notifications,
it blocks in write() sending the next one, because the OS buffer for the socket 
is full.

#0  0x90035d04 in writev ()
#1  0x06401890 in ACE_OS::writev(int, iovec const*, int) (handle=42, 
iov=0xbfffe310, iovcnt=1) 
at /Developer/ACETAO-1.5/ACE_wrappers/ace/OS_NS_sys_uio.inl:50
#2  0x06401b9c in ACE_OS::sendv(int, iovec const*, int) (handle=42, 
buffers=0xbfffe310, n=1) 
at /Developer/ACETAO-1.5/ACE_wrappers/ace/OS_NS_sys_socket.inl:818
#3  0x062f4e30 in ACE::sendv(int, iovec const*, int, ACE_Time_Value const*) 
(handle=42, iov=0xbfffe310, iovcnt=1, timeout=0x0) at ACE.cpp:1812
#4  0x08f42a4c in ACE_SOCK_IO::sendv(iovec const*, unsigned long, 
ACE_Time_Value const*) const (this=0x140c4d0, iov=0xbfffe310, n=1, timeout=0x0) 
at /Developer/ACETAO-1.5/ACE_wrappers/ace/SOCK_IO.inl:126
#5  0x08dc47dc in TAO_IIOP_Transport::send(iovec*, int, unsigned long&, 
ACE_Time_Value const*) (this=0x140bd40, iov=0xbfffe310, iovcnt=1, 
bytes_transferred=@0xbfffe290, max_wait_time=0x0) at IIOP_Transport.cpp:71
#6  0x08e1ce0c in TAO_Transport::drain_queue_helper(int&, iovec*) 
(this=0x140bd40, iovcnt=@0xbfffe300, iov=0xbfffe310) at Transport.cpp:764
#7  0x08e1d204 in TAO_Transport::drain_queue_i() (this=0x140bd40) at 
Transport.cpp:881
#8  0x08e1c034 in TAO_Transport::send_message_block_chain_i(ACE_Message_Block 
const*, unsigned long&, ACE_Time_Value*) (this=0x140bd40, mb=0x7848828, 
bytes_transferred=@0xbfffe4ac) at Transport.cpp:480
#9  0x08e1dd6c in TAO_Transport::send_asynchronous_message_i(TAO_Stub*, 
ACE_Message_Block const*, ACE_Time_Value*) (this=0x140bd40, stub=0x140d020, 
message_block=0x7848828, max_wait_time=0xbfffe8b0) at Transport.cpp:1122
#10 0x08e1dbe8 in TAO_Transport::send_message_shared_i(TAO_Stub*, int, 
ACE_Message_Block const*, ACE_Time_Value*) (this=0x140bd40, stub=0x140d020, 
message_semantics=0, message_block=0x7848828, max_wait_time=0xbfffe8b0) at 
Transport.cpp:1074
#11 0x08dc4cc4 in TAO_IIOP_Transport::send_message_shared(TAO_Stub*, int, 
ACE_Message_Block const*, ACE_Time_Value*) (this=0x140bd40, stub=0x140d020, 
message_semantics=0, message_block=0x7848828, max_wait_time=0xbfffe8b0) at 
IIOP_Transport.cpp:200
#12 0x08dc4b8c in TAO_IIOP_Transport::send_message(TAO_OutputCDR&, TAO_Stub*, 
int, ACE_Time_Value*) (this=0x140bd40, stream=@0x7848828, stub=0x140d020, 
message_semantics=0, max_wait_time=0xbfffe8b0) at IIOP_Transport.cpp:166
#13 0x08dc4a98 in TAO_IIOP_Transport::send_request(TAO_Stub*, TAO_ORB_Core*, 
TAO_OutputCDR&, int, ACE_Time_Value*) (this=0x140bd40, stub=0x140d020, 
orb_core=0x780e400, stream=@0x7848828, message_semantics=0, 
max_wait_time=0xbfffe8b0) at IIOP_Transport.cpp:144
#14 0x08df8818 in TAO::Remote_Invocation::send_message(TAO_OutputCDR&, short, 
ACE_Time_Value*) (this=0xbfffe810, cdr=@0x7848828, message_semantics=0, 
max_wait_time=0xbfffe8b0) at Remote_Invocation.cpp:161
#15 0x08e01d6c in TAO::Synch_Oneway_Invocation::remote_oneway(ACE_Time_Value*) 
(this=0xbfffe810, max_wait_time=0xbfffe8b0) at Synch_Invocation.cpp:723
#16 0x08dc7c84 in 
TAO::Invocation_Adapter::invoke_oneway(TAO_Operation_Details&, 
TAO_Pseudo_Var_T<CORBA::Object>&, TAO::Profile_Transport_Resolver&, 
ACE_Time_Value*&) (this=0xbfffeb00, details=@0xbfffe9e0, 
effective_target=@0xbfffe950, r=@0xbfffe8d0, max_wait_time=@0xbfffe964) at 
Invocation_Adapter.cpp:347
#17 0x08dc7a08 in TAO::Invocation_Adapter::invoke_remote_i(TAO_Stub*, 
TAO_Operation_Details&, TAO_Pseudo_Var_T<CORBA::Object>&, ACE_Time_Value*&) 
(this=0xbfffeb00, stub=0x140d020, details=@0xbfffe9e0, 
effective_target=@0xbfffe950, max_wait_time=@0xbfffe964) at 
Invocation_Adapter.cpp:275
#18 0x08dc7410 in TAO::Invocation_Adapter::invoke_i(TAO_Stub*, 
TAO_Operation_Details&) (this=0xbfffeb00, stub=0x140d020, details=@0xbfffe9e0) 
at Invocation_Adapter.cpp:91
#19 0x08dc72a0 in TAO::Invocation_Adapter::invoke(TAO::Exception_Data*, 
unsigned long) (this=0xbfffeb00, ex_data=0x0, ex_count=0) at 
Invocation_Adapter.cpp:50
#20 0x00031978 in QMgr::Reader::notify(char const*, char const*, char const*) 
(this=0x140bca0, queueName=0x141c840 "Test", seqnum=0x141c850 "1287", 
message=0x141c860 "m1287") at .gen/QMgrC.cpp:155

I have tried checking the size of the queue in TAO_Transport, it's always 
empty, since each message was actually written out to the socket,
so I can't just add some logic to "skip sending this notification if the 
transport queue is full".

I have tried setting the sync scope policy to Messaging::SYNC_NONE, no change, 
it still blocks when constraints_reached=true (as opposed to blocking due to 
try_sending_first), via TAO_Block_Flushing_Strategy::schedule_output(). (I 
think that changing the flushing strategy wouldn't help since this is a 
multithreaded application where the other threads do completely unrelated 
things, I certainly don't want them to start blocking on sending this thread's 
messages).

So I have tried to force queueing (SYNC_NONE as well as 
TAO::BufferingConstraint with mode = TAO::BUFFER_MESSAGE_COUNT and 
message_count = CORBA::ULong(-1)), but now the messages are never sent anymore. 
I guess I need to flush the queue somewhen (maybe from a timer), but that would 
still block the sender for as long as the receiver isn't responding I'm afraid.

Finally I tried to get the following behavior: "try sending first, but if it 
doesn't work after a few seconds, give up and queue for later,
and send that one when the next notification is to be sent [or from a timer]". 
However setting up a
Messaging::RELATIVE_RT_TIMEOUT_POLICY_TYPE with 2 * 10000000u in a 
TimeBase::TimeT has no effect.
The timeout value is set correctly by the invocation adapter, but it isn't used 
by the transport:
466     int
467     TAO_Transport::send_message_block_chain_i (const ACE_Message_Block *mb,
468                                                size_t &bytes_transferred,
469                                                ACE_Time_Value *)
The ACE_Time_Value is ignored!

If it's really the case that sending oneways can always end up blocking, then a 
solution could be a separate thread which takes care of making the oneway 
calls. It seems overkill to multiply the number of threads by 2 just for that, 
but I guess this would work... if nothing else does.

Ah, I also tried using DIOP. But because it doesn't support twoways at all, 
there's no way to ping() the receiver to check that it's actually alive 
(initially, and regularly). (Of course pinging receivers can block the sender, 
but this is done from a timer, not from the normal code so the risk of 
deadlocks is lower.). This might be another solution with a redesign: pinging 
via two asynchronous oneways (a ping and a pong back) over DIOP...

Here's a self-contained testcase that shows that it's impossible to send
more than 1446 oneway messages to a receiver in a blocked thread
(orb event loop temporarily not running).

Save all attached files in an empty directory, run qmake (from either qt3 or 
qt4, both work), then make, then ./oneway_test

At the TAO level, the transport queue is always empty. The data is all being 
buffered some level below (I guess in the operating system). This is why the 
various SyncScope policies don't seem to help in any way. So in the end, either 
TAO queues too much (nothing gets sent) or we never queue 
(try_sending_first=true, and this blocks at the 1447th message).
A working timeout would be a solution, but I'm open to other suggestions.
(Hmm, always queueing [somehow] and having another thread continuously flush 
the queue? I really wanted to avoid additional threads though).

More details in oneway_test.cpp. It's written like a unit test, just try to run 
it and you'll see whether it gets to the end.
Comment 1 dfaure 2007-01-25 03:25:40 CST
Created attachment 651 [details]
automated test for many oneway calls to a busy receiver
Comment 2 Johnny Willemsen 2007-02-20 08:56:06 CST
Adam, can you have a look at this because with the commmit below you added a
changelog saying that oneways would never block, but this report is about the
fact that oneways do block in some cases.

Tue Aug 15 14:56:35 UTC 2006  Adam Mitz  <mitza@ociweb.com>
Comment 3 Steve Totten 2007-02-20 14:51:00 CST
Adam is busy at the moment.

We need a PRF from the original reporter.  Also, would the original reporter
please run $TAO_ROOT/tests/Oneway_Timeouts/run_test.pl and report the results?
Comment 4 Johnny Willemsen 2007-02-21 01:53:48 CST
to reporter
Comment 5 dfaure 2007-02-21 09:04:40 CST
The PRF was on the tao-bugs list in November already...

    TAO VERSION: 1.5  - also tried with 1.5.4
    ACE VERSION: 5.5  - and 5.5.4

    HOST MACHINE and OPERATING SYSTEM: Apple G5, Mac OS 10.3
       - also tried on Linux-2.6.17 Kubuntu-6.10 amd64.

    COMPILER NAME AND VERSION (AND PATCHLEVEL): gcc 3.3 20030304
       - gcc-4.1.2 on Linux

    CONTENTS OF $ACE_ROOT/ace/config.h: config-macosx-panther.h
       - config-linux.h on Linux

    CONTENTS OF 
$ACE_ROOT/include/makeinclude/platform_macros.GNU: platform_macosx.GNU
       - platform_linux.GNU on Linux

    AREA/CLASS/EXAMPLE AFFECTED: 
TAO_Transport::send_message_block_chain_i, making oneway calls

    DOES THE PROBLEM AFFECT: EXECUTION

    SYNOPSIS: see report

I'll attach the test results. What about my test program? Does it work for you?
Comment 6 Steve Totten 2007-02-21 09:11:34 CST
dfaure@klaralvdalens-datakonsult.se wrote:
> I'll attach the test results.

Thanks.

> What about my test program? Does it work for you?

I was unable to view the attachment in my web browser. What file format was it?
Try attaching a .tar.gz file.

Steve
Comment 7 dfaure 2007-02-21 09:11:42 CST
Created attachment 664 [details]
results on linux
Comment 8 dfaure 2007-02-21 09:12:21 CST
Created attachment 665 [details]
Mac results (this test is not really reliable if it sometimes fails due to timing reasons...)
Comment 9 dfaure 2007-02-21 09:15:39 CST
It -is- a .tar.gz file... I uploaded it to 
http://kdab.net/~dfaure/oneway_test.tar.gz now.
Comment 10 Steve Totten 2007-02-21 13:03:44 CST
dfaure@klaralvdalens-datakonsult.se wrote:

> It -is- a .tar.gz file... I uploaded it to 
> http://kdab.net/~dfaure/oneway_test.tar.gz now.

Thanks, I have the test, now.  Strange that when I clicked on the
attachment listed in the Bug, all I got was an option to save the .cgi
file.

I looked very briefly at your test.  I'm not sure how much time I
or anyone on my team will have in the next several days to look at
this.

dfaure@klaralvdalens-datakonsult.se wrote:
> (this test is not really reliable if it sometimes fails due to
> timing reasons...)

Yeah, testing for timeout conditions is not easy on an operating
system where you don't have much control over scheduling.

I looked at the results of the Oneway_Timeout test on the DOC
scoreboard and see some inconsistent results across platforms,
there too.  Sometimes, it reports that something took longer than
expected.  Other times, it fails because an executable was still
running when the test exited (which may mean it was hung or it
hadn't finished its work yet--another kind of timeout condition).

A colleague of mine wrote:
> I beat my head against this test a few weeks ago. There's a particular 
> test case that expects the connection to be deferred so that the client 
> queues up a bunch of messages that are then blasted out as soon as the 
> connection completes. But in many cases, BSD, HP, and Solaris I think, 
> the connection completes right away and thus no client-side queuing. The 
> server side then kicks this out as an error since the time from the first 
> message to the last message is too long.

Anyway, there is little time to spare on this problem from my end.  If
you can help debug what is going on, and maybe convert your test into
a new one-button test that could be added to the repository (including
an MPC file and making sure it builds and runs across various platforms
and build configurations), it would be much appreciated.