Bug 3730 - Unmarshalling/invoking AMI replies rapidly leads to deep recursion and Seg Fault (worked in 1.4.8)
Summary: Unmarshalling/invoking AMI replies rapidly leads to deep recursion and Seg Fa...
Status: NEW
Alias: None
Product: TAO
Classification: Unclassified
Component: AMI (show other bugs)
Version: 1.7.1
Hardware: All Linux
: P3 normal
Assignee: DOC Center Support List (internal)
URL:
Depends on:
Blocks:
 
Reported: 2009-08-24 18:11 CDT by David Michael
Modified: 2009-10-15 15:26 CDT (History)
1 user (show)

See Also:


Attachments
Multi-threaded client with two orbs (client & server) that exhibits the problem (10.73 KB, text/plain)
2009-08-24 18:11 CDT, David Michael
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Michael 2009-08-24 18:11:30 CDT
Created attachment 1183 [details]
Multi-threaded client with two orbs (client & server) that exhibits the problem

I'm working on an application that sends many outgoing AMI calls, and therefore receives many incoming replies (as well as many incoming SMI calls).

My design is as follows:
1 thread handles all incoming requests using a Server ORB.  Call this the "server thread".
n threads send outgoing calls using a client ORB
1 thread calls perform work on the client ORB repeatedly to allow AMI replies to be unmarshalled and then invoked on the server ORB.  Call this the "client work thread".

It appears that if I invoke many AMI messages rapidly, the perform_work call keeps recursing, until ultimately (somewhere between 1000 and 10000 messages on my platform) the process seg faults because the stack is too big.

I modified the code of $TAO_ROOT/tests/AMI/client.cc to create the attached two_orb_client.cc, which approximates the design of my application and causes the bad behavior.  Run as follows:
> server -o test.ior
in another shell:
> two_orb_client -i 10000 -ORBCollocation no
If that doesn't cause a seg fault on your platform, try a larger number of iterations (the -i parameter).

I see this stack trace in the client worker thread that looks like this (many many lines omitted):
#1  0x0000002a95f94fa8 in TAO_Transport::send_synch_message_helper_i (
    this=0x5fd7f0, synch_message=@0x43209f30, max_wait_time=0x0)
    at ../../../TAO/tao/Transport.cpp:783
#2  0x0000002a95f9536a in TAO_Transport::send_synchronous_message_i (
    this=0x5fd7f0, mb=0x5fd9e0, max_wait_time=0x0)
    at ../../../TAO/tao/Transport.cpp:628
#3  0x0000002a95f95ff5 in TAO_Transport::send_message_shared_i (this=0x5fd7f0, 
    stub=Variable "stub" is not available.
) at ../../../TAO/tao/Transport.cpp:1326
#4  0x0000002a95f96096 in TAO_Transport::send_message_shared (this=0x5fd7f0, 
    stub=0x586ec0, message_semantics=TAO_TWOWAY_REQUEST, 
    message_block=0x5fd9e0, max_wait_time=0x0)
    at ../../../TAO/tao/Transport.cpp:305
#5  0x0000002a95f36a07 in TAO_IIOP_Transport::send_message (this=0x5fd7f0, 
    stream=@0x5fd9e0, stub=0x586ec0, message_semantics=TAO_TWOWAY_REQUEST, 
    max_wait_time=0x0) at ../../../TAO/../ace/CDR_Stream.inl:502
#6  0x0000002a95f36985 in TAO_IIOP_Transport::send_request (this=0x5fd7f0, 
    stub=0x586ec0, orb_core=Variable "orb_core" is not available.
) at ../../../TAO/tao/IIOP_Transport.cpp:217
#7  0x0000002a95f6e72d in TAO::Remote_Invocation::send_message (
    this=0x4320a1a0, cdr=@0x5fd9e0, message_semantics=TAO_TWOWAY_REQUEST, 
    max_wait_time=0x0) at ../../../TAO/tao/Profile_Transport_Resolver.inl:45
#8  0x0000002a95f79dfe in TAO::Synch_Twoway_Invocation::remote_twoway (
    this=0x4320a1a0, max_wait_time=0x0)
    at ../../../TAO/tao/Synch_Invocation.cpp:126
#9  0x0000002a95f3c83b in TAO::Invocation_Adapter::invoke_twoway (
    this=0x4320a560, details=Variable "details" is not available.
) at ../../../TAO/tao/Invocation_Adapter.cpp:299
#10 0x0000002a95f3c113 in TAO::Invocation_Adapter::invoke_remote_i (
    this=0x4320a560, stub=0x586ec0, details=@0x4320a450, 
    effective_target=@0x4320a3e0, max_wait_time=@0x4320a338)
    at ../../../TAO/tao/Invocation_Adapter.cpp:270
#11 0x0000002a95f3cc67 in TAO::Invocation_Adapter::invoke_i (this=0x4320a560, 
    stub=0x586ec0, details=@0x4320a450)
    at ../../../TAO/tao/Invocation_Adapter.cpp:92
#12 0x0000002a95f3bfaa in TAO::Invocation_Adapter::invoke (this=0x4320a560, 
    ex_data=0x52aae0, ex_count=Variable "ex_count" is not available.
) at ../../../TAO/tao/Invocation_Adapter.cpp:50
#13 0x000000000040c103 in A::AMI_AMI_TestHandler::foo (this=0x8b8990, 
    ami_return_val=931234, out_l=931233) at ami_testC.cpp:889
#14 0x000000000040c2eb in A::AMI_AMI_TestHandler::foo_reply_stub (
    _tao_in=@0x2a9698ddd8, _tao_reply_handler=0x5862b0, reply_status=0)
    at ami_testC.cpp:920
#15 0x0000002a9559a739 in TAO_Asynch_Reply_Dispatcher::dispatch_reply (
    this=0x2a9698db48, params=Variable "params" is not available.
) at ../../../TAO/tao/Objref_VarOut_T.inl:47
#16 0x0000002a95f48dec in TAO_Muxed_TMS::dispatch_reply (this=Variable "this" i
 not available.
)
    at ../../../TAO/../ace/Intrusive_Auto_Ptr.inl:28
#17 0x0000002a95f1b680 in TAO_GIOP_Message_Base::process_reply_message (
    this=0x2a969259d0, params=@0x43209eb0, qd=Variable "qd" is not available.
)
    at ../../../TAO/tao/Transport.inl:27
#18 0x0000002a95f92672 in TAO_Transport::process_parsed_messages (
    this=0x2a96925810, qd=0x2a97e222f8, rh=@0x4320aa30)
    at ../../../TAO/tao/Transport.inl:130
#19 0x0000002a95f92e3c in TAO_Transport::process_queue_head (
    this=0x2a96925810, rh=@0x4320aa30) at ../../../TAO/tao/Transport.cpp:2561
#20 0x0000002a95f93ae8 in TAO_Transport::handle_input (this=0x2a96925810, 
    rh=@0x4320aa30, max_wait_time=0x0) at ../../../TAO/tao/Transport.cpp:1640
#21 0x0000002a95f14185 in TAO_Connection_Handler::handle_input_internal (
    this=0x2a9691d6a8, h=-1, eh=0x2a9691d5e0)
    at ../../../TAO/tao/Connection_Handler.inl:17
#22 0x0000002a95f142fe in TAO_Connection_Handler::handle_input_eh (
    this=0x2a9691d6a8, h=-1, eh=0x2a9691d5e0)
    at ../../../TAO/tao/Connection_Handler.cpp:241
#23 0x0000002a962564d7 in ACE_Select_Reactor_Notify::dispatch_notify (this=Variable "this" is not available.
)
    at ../../ace/Select_Reactor_Base.cpp:832
#24 0x0000002a9626dff5 in ACE_TP_Reactor::handle_notify_events (this=Variable "this" is not available.
)
    at ../../ace/TP_Reactor.cpp:381
#25 0x0000002a9626f1d8 in ACE_TP_Reactor::dispatch_i (this=0x54e410, 
    max_wait_time=Variable "max_wait_time" is not available.
) at ../../ace/TP_Reactor.cpp:233
#26 0x0000002a9626f293 in ACE_TP_Reactor::handle_events (this=0x54e410, 
    max_wait_time=0x0) at ../../ace/TP_Reactor.cpp:173
#27 0x0000002a95f40614 in TAO_Leader_Follower::wait_for_event (this=0x54e070, 
    event=0xb273a0, transport=Variable "transport" is not available.
) at ../../../TAO/../ace/Reactor.inl:128
#28 0x0000002a95f79701 in TAO::Synch_Twoway_Invocation::wait_for_reply (
    this=0x4320ade0, max_wait_time=0x43209eb0, rd=@0x0, bd=@0x0)
    at ../../../TAO/tao/Transport.inl:47
#29 0x0000002a95f79eaa in TAO::Synch_Twoway_Invocation::remote_twoway (
    this=0x4320ade0, max_wait_time=0x0)
    at ../../../TAO/../ace/Intrusive_Auto_Ptr.inl:40
#30 0x0000002a95f3c83b in TAO::Invocation_Adapter::invoke_twoway (
    this=0x4320b1a0, details=Variable "details" is not available.
) at ../../../TAO/tao/Invocation_Adapter.cpp:299
#31 0x0000002a95f3c113 in TAO::Invocation_Adapter::invoke_remote_i (
    this=0x4320b1a0, stub=0x586ec0, details=@0x4320b090, 
    effective_target=@0x4320b020, max_wait_time=@0x4320af78)
    at ../../../TAO/tao/Invocation_Adapter.cpp:270
#32 0x0000002a95f3cc67 in TAO::Invocation_Adapter::invoke_i (this=0x4320b1a0, 
    stub=0x586ec0, details=@0x4320b090)
    at ../../../TAO/tao/Invocation_Adapter.cpp:92
#33 0x0000002a95f3bfaa in TAO::Invocation_Adapter::invoke (this=0x4320b1a0, 
    ex_data=0x52aae0, ex_count=Variable "ex_count" is not available.
) at ../../../TAO/tao/Invocation_Adapter.cpp:50
#34 0x000000000040c103 in A::AMI_AMI_TestHandler::foo (this=0x8b9220, 
    ami_return_val=931234, out_l=931233) at ami_testC.cpp:889
#35 0x000000000040c2eb in A::AMI_AMI_TestHandler::foo_reply_stub (
    _tao_in=@0x2a97675038, _tao_reply_handler=0x5862b0, reply_status=0)
    at ami_testC.cpp:920
#36 0x0000002a9559a739 in TAO_Asynch_Reply_Dispatcher::dispatch_reply (
    this=0x2a97674da8, params=Variable "params" is not available.
) at ../../../TAO/tao/Objref_VarOut_T.inl:47
#37 0x0000002a95f48dec in TAO_Muxed_TMS::dispatch_reply (this=Variable "this" is not available.
)
    at ../../../TAO/../ace/Intrusive_Auto_Ptr.inl:28
#38 0x0000002a95f1b680 in TAO_GIOP_Message_Base::process_reply_message (
    this=0x61c400, params=@0x43209eb0, qd=Variable "qd" is not available.
)
    at ../../../TAO/tao/Transport.inl:27
#39 0x0000002a95f92672 in TAO_Transport::process_parsed_messages (
    this=0x61c240, qd=0x2a97e398e0, rh=@0x4320b670)
    at ../../../TAO/tao/Transport.inl:130
#40 0x0000002a95f92e3c in TAO_Transport::process_queue_head (this=0x61c240, 
    rh=@0x4320b670) at ../../../TAO/tao/Transport.cpp:2561
#41 0x0000002a95f93ae8 in TAO_Transport::handle_input (this=0x61c240, 
    rh=@0x4320b670, max_wait_time=0x0) at ../../../TAO/tao/Transport.cpp:1640
#42 0x0000002a95f14185 in TAO_Connection_Handler::handle_input_internal (
    this=0x6140d8, h=-1, eh=0x614010)
    at ../../../TAO/tao/Connection_Handler.inl:17
#43 0x0000002a95f142fe in TAO_Connection_Handler::handle_input_eh (
    this=0x6140d8, h=-1, eh=0x614010)
    at ../../../TAO/tao/Connection_Handler.cpp:241
#44 0x0000002a962564d7 in ACE_Select_Reactor_Notify::dispatch_notify (this=Variable "this" is not available.
)
    at ../../ace/Select_Reactor_Base.cpp:832
#45 0x0000002a9626dff5 in ACE_TP_Reactor::handle_notify_events (this=Variable "this" is not available.
)
    at ../../ace/TP_Reactor.cpp:381
#46 0x0000002a9626f1d8 in ACE_TP_Reactor::dispatch_i (this=0x54e410, 
    max_wait_time=Variable "max_wait_time" is not available.
) at ../../ace/TP_Reactor.cpp:233
#47 0x0000002a9626f293 in ACE_TP_Reactor::handle_events (this=0x54e410, 
    max_wait_time=0x0) at ../../ace/TP_Reactor.cpp:173
#48 0x0000002a95f40614 in TAO_Leader_Follower::wait_for_event (this=0x54e070, 
    event=0x80c3a0, transport=Variable "transport" is not available.
) at ../../../TAO/../ace/Reactor.inl:128
#49 0x0000002a95f79701 in TAO::Synch_Twoway_Invocation::wait_for_reply (
    this=0x4320ba20, max_wait_time=0x43209eb0, rd=@0x0, bd=@0x0)
    at ../../../TAO/tao/Transport.inl:47
#50 0x0000002a95f79eaa in TAO::Synch_Twoway_Invocation::remote_twoway (
    this=0x4320ba20, max_wait_time=0x0)
    at ../../../TAO/../ace/Intrusive_Auto_Ptr.inl:40
#51 0x0000002a95f3c83b in TAO::Invocation_Adapter::invoke_twoway (
    this=0x4320bde0, details=Variable "details" is not available.
) at ../../../TAO/tao/Invocation_Adapter.cpp:299
#52 0x0000002a95f3c113 in TAO::Invocation_Adapter::invoke_remote_i (
    this=0x4320bde0, stub=0x586ec0, details=@0x4320bcd0, 
    effective_target=@0x4320bc60, max_wait_time=@0x4320bbb8)
    at ../../../TAO/tao/Invocation_Adapter.cpp:270
#53 0x0000002a95f3cc67 in TAO::Invocation_Adapter::invoke_i (this=0x4320bde0, 
    stub=0x586ec0, details=@0x4320bcd0)
    at ../../../TAO/tao/Invocation_Adapter.cpp:92
#54 0x0000002a95f3bfaa in TAO::Invocation_Adapter::invoke (this=0x4320bde0, 
    ex_data=0x52aae0, ex_count=Variable "ex_count" is not available.
) at ../../../TAO/tao/Invocation_Adapter.cpp:50
#55 0x000000000040c103 in A::AMI_AMI_TestHandler::foo (this=0x8b6760, 
    ami_return_val=931234, out_l=931233) at ami_testC.cpp:889
#56 0x000000000040c2eb in A::AMI_AMI_TestHandler::foo_reply_stub (
    _tao_in=@0x2a96eba800, _tao_reply_handler=0x5862b0, reply_status=0)
    at ami_testC.cpp:920
#57 0x0000002a9559a739 in TAO_Asynch_Reply_Dispatcher::dispatch_reply (
    this=0x2a96eba570, params=Variable "params" is not available.
) at ../../../TAO/tao/Objref_VarOut_T.inl:47
#58 0x0000002a95f48dec in TAO_Muxed_TMS::dispatch_reply (this=Variable "this" is not available.
)
    at ../../../TAO/../ace/Intrusive_Auto_Ptr.inl:28
#59 0x0000002a95f1b680 in TAO_GIOP_Message_Base::process_reply_message (
    this=0x587440, params=@0x43209eb0, qd=Variable "qd" is not available.
)
    at ../../../TAO/tao/Transport.inl:27
#60 0x0000002a95f92672 in TAO_Transport::process_parsed_messages (
    this=0x587280, qd=0x2a97bfadd0, rh=@0x4320c2b0)
    at ../../../TAO/tao/Transport.inl:130
#60 0x0000002a95f92672 in TAO_Transport::process_parsed_messages (
    this=0x587280, qd=0x2a97bfadd0, rh=@0x4320c2b0)
    at ../../../TAO/tao/Transport.inl:130
#61 0x0000002a95f92e3c in TAO_Transport::process_queue_head (this=0x587280, 
    rh=@0x4320c2b0) at ../../../TAO/tao/Transport.cpp:2561
#62 0x0000002a95f93ae8 in TAO_Transport::handle_input (this=0x587280, 
    rh=@0x4320c2b0, max_wait_time=0x0) at ../../../TAO/tao/Transport.cpp:1640
#63 0x0000002a95f14185 in TAO_Connection_Handler::handle_input_internal (
    this=0x587128, h=-1, eh=0x587060)
    at ../../../TAO/tao/Connection_Handler.inl:17
#64 0x0000002a95f142fe in TAO_Connection_Handler::handle_input_eh (
    this=0x587128, h=-1, eh=0x587060)
    at ../../../TAO/tao/Connection_Handler.cpp:241
#65 0x0000002a962564d7 in ACE_Select_Reactor_Notify::dispatch_notify (this=Variable "this" is not available.
)
    at ../../ace/Select_Reactor_Base.cpp:832
#66 0x0000002a9626dff5 in ACE_TP_Reactor::handle_notify_events (this=Variable "this" is not available.
)
    at ../../ace/TP_Reactor.cpp:381
#67 0x0000002a9626f1d8 in ACE_TP_Reactor::dispatch_i (this=0x54e410, 
    max_wait_time=Variable "max_wait_time" is not available.
) at ../../ace/TP_Reactor.cpp:233
#68 0x0000002a9626f293 in ACE_TP_Reactor::handle_events (this=0x54e410, 
    max_wait_time=0x0) at ../../ace/TP_Reactor.cpp:173
#69 0x0000002a95f40614 in TAO_Leader_Follower::wait_for_event (this=0x54e070, 
    event=0xb28900, transport=Variable "transport" is not available.
) at ../../../TAO/../ace/Reactor.inl:128
...
...
...
#69148 0x0000002a9626f1f4 in ACE_TP_Reactor::dispatch_i (this=0x54e410, 
    max_wait_time=Variable "max_wait_time" is not available.
) at ../../ace/TP_Reactor.cpp:244
#69149 0x0000002a9626f293 in ACE_TP_Reactor::handle_events (this=0x54e410, 
    max_wait_time=0x0) at ../../ace/TP_Reactor.cpp:173
#69150 0x0000002a95f50a7d in TAO_ORB_Core::run (this=0x53efe0, tv=0x0, 
    perform_work=1) at ../../../TAO/../ace/Reactor.inl:128
#69151 0x0000000000419d9d in ClientWorker::svc (this=0x7fbffff2a0)
    at ../../../../TAO/tests/AMI/two_orb_client.cpp:417
#69152 0x0000002a962700da in ACE_Task_Base::svc_run (args=Variable "args" is not available.
)
    at ../../ace/Task.cpp:275
#69153 0x0000002a962712f5 in ACE_Thread_Adapter::invoke_i (this=Variable "this" is not available.
)
    at ../../ace/Thread_Adapter.cpp:149
#69154 0x0000002a96271258 in ACE_Thread_Adapter::invoke (this=0x5d5820)
    at ../../ace/Thread_Adapter.cpp:98
#69155 0x00000033d9506137 in start_thread () from /lib64/tls/libpthread.so.0
#69156 0x00000033d88c7533 in clone () from /lib64/tls/libc.so.6


I haven't done it in this test, but in my real application I give perform_work a max wait time of a second, but it isn't honored, either.

When I encountered this problem, I rolled back our version of TAO to 1.4.8, and it seems to work fine there.
Comment 1 Steve Totten 2009-08-25 15:48:37 CDT
Hi David,

Your deep stack trace indicates your client worker thread is experiencing nested upcalls.  Your AMI reply handler is making a synchronous twoway invocation on an object hosted by the server ORB.  While waiting for a reply, it receives another AMI callback and makes another synchronous twoway invocation.  While waiting for a reply, it receives another AMI callback, and so on.

You could try allowing collocation by removing the "-ORBCollocation no" option.  Then, the twoway invocations from the reply handler would by invoked on the same thread as the one making the call and that thread would not be subject to nested AMI callbacks.  However, I believe you turned off collocation at my suggestion in a previous bug report.  (I would have to go back and review it to recall the details.)

I think you are running into issues that are very common for "middle-tier" server/client applications.  Your application is performing the roles of both server and client, which makes it difficult to scale up without hitting problems such as deeply nested upcalls (or, in this case, deeply nested AMI callbacks).  You are on the right track by using AMI.  But, AMI alone does not go far enough.  It only makes the client side of your application behave asynchronously.  You should consider adding TAO's Asynchronous Method Handling (AMH) to make the server side also behave asynchronously.  We at OCI have found this combination of AMH with AMI to be a very powerful technique for making middle-tier applications scalable and efficient.

Here are some resources about AMH/AMI for learning more:

http://www.cs.wustl.edu/~schmidt/PDF/AMH.pdf
http://cnb.ociweb.com/cnb/CORBANewsBrief-200308.html

In addition, OCI's TAO Developer's Guide contains a chapter about AMH that details how to combine AMH and AMI in a middle-tier application.  In fact, we have just finished a new version of the Developer's Guide that is consistent with the OCI TAO 1.6a release.  The new guide will be available for purchase from our web site very soon.  Contact sales@ociweb.com in the meantime.

Finally, OCI's Advanced CORBA Programming with TAO training course offers a module on AMH/AMI for middle-tier applications.  See http://www.ociweb.com/training/Distributed-Computing for more information.

Steve
Comment 2 David Michael 2009-10-15 12:05:35 CDT
Sorry for taking so long to get back to you, my wife had a baby shortly after I submitted, and then other priorities trumped this work for a while.

Steve, you nailed it...  this is a middle-tier kind of program I'm working on.  I wondered if I could badger you with a couple of further questions.  If you don't have time or need to check to see if we're covered for support, that's fine.  But hopefully you can answer quickly off the top of your well-informed head :-)
* You mention AMH as a possible solution.  I read up a little on it, and I'm concerned that it wouldn't really solve my problem.  The "reply" that gets sent from the client worker thread is actually handled very rapidly on the server side (this obviously is also true in the attached code).  There's very little work that my lone "server" thread does;  it pretty much always sends a job off almost immediately to some other thread (a lot like I would do with AMH, I expect).  I don't think AMH would make it significantly faster, so I'm wondering if I could still run in to the same situation.  It seems like the client worker thread could still do nested upcalls if replies are coming in rapidly, since it doesn't know or care how calls get handled by the server.  Am I missing something here?
* I've been trying like crazy to avoid nested upcalls because they cause so much trouble for a multi-threaded app.  But I can't turn them off on my "client" ORB because I'm using AMI.  The whole reason I'm using two orbs is to protect myself from nested upcalls, but you're right, they apparently still happen under the hood in the AMI reply handling code (which of course uses SMI).  This behavior seems to me to be a bug in TAO, if for no other reason than because my timeout parameter to ORB::perform_work gets ignored completely in this case.  For what it's worth, this stack overflow doesn't seem to happen with 1.4.8, so in some sense this is a regression (although it isn't tested).  What do you think, bug or no bug?  (By the way, I'm not sure in what version it changed;  we just happen to be using 1.4.8 currently).
* Can I somehow workaround this by having both ORBs perform work in 1 thread (and allow collocation)?  It seems like there might not be a good way to make sure I don't starve either ORB or use excess CPU by constantly checking both.  But maybe somebody's figured out how to do that well.

...side note:  If there was a way to turn off nested upcalls for AMI (but use 1 ORB), I wouldn't be causing such trouble :-).
Comment 3 Steve Totten 2009-10-15 14:05:30 CDT
Hi Dave,

(In reply to comment #2)
> Sorry for taking so long to get back to you, my wife had a baby shortly after I
> submitted, and then other priorities trumped this work for a while.

Congratulations on your new baby!

> Steve, you nailed it...  this is a middle-tier kind of program I'm working on. 
> I wondered if I could badger you with a couple of further questions.  If you
> don't have time or need to check to see if we're covered for support, that's
> fine.  But hopefully you can answer quickly off the top of your well-informed
> head :-)

No problem.

> * You mention AMH as a possible solution.  I read up a little on it, and I'm
> concerned that it wouldn't really solve my problem.  The "reply" that gets sent
> from the client worker thread is actually handled very rapidly on the server
> side (this obviously is also true in the attached code).  There's very little
> work that my lone "server" thread does;  it pretty much always sends a job off
> almost immediately to some other thread (a lot like I would do with AMH, I
> expect).  I don't think AMH would make it significantly faster, so I'm
> wondering if I could still run in to the same situation.  It seems like the
> client worker thread could still do nested upcalls if replies are coming in
> rapidly, since it doesn't know or care how calls get handled by the server.  Am
> I missing something here?

Using AMH in conjunction with AMI, I think you could eliminate the "client-only
ORB".  All incoming invocations would be handled asynchronously (the AMH side
of things) and all outgoing invocations would be sent asynchronously (the AMI
side of things).  Even the reply handler callbacks would be subject to the
AMH model.  So, there should be no possibility of nested upcalls.

> * I've been trying like crazy to avoid nested upcalls because they cause so
> much trouble for a multi-threaded app.  But I can't turn them off on my
> "client" ORB because I'm using AMI.

Right, because options like -ORBWaitStrategy rw are incompatible with AMI.
Those options are intended to help avoid nested upcalls when requests are
sent synchronously.

> The whole reason I'm using two orbs is to
> protect myself from nested upcalls, but you're right, they apparently still
> happen under the hood in the AMI reply handling code (which of course uses
> SMI).

The "two-ORB" approach to avoiding nested upcalls is also intended for use
when requests are sent synchronously.  When you use AMI, the "client ORB" has
to become a "server ORB" (thus, the calls to perform_work()), so it defeats
the purpose of separating server and client behavior between two ORBs.

> This behavior seems to me to be a bug in TAO, if for no other reason
> than because my timeout parameter to ORB::perform_work gets ignored completely
> in this case.  For what it's worth, this stack overflow doesn't seem to happen
> with 1.4.8, so in some sense this is a regression (although it isn't tested). 
> What do you think, bug or no bug?  (By the way, I'm not sure in what version it
> changed;  we just happen to be using 1.4.8 currently).

The differences from 1.4.8 to 1.7.1 are numerous, to say the least.  It would
take a lot of digging to try to figure out just what change introduced this
new problem you are encountering.

> * Can I somehow workaround this by having both ORBs perform work in 1 thread
> (and allow collocation)?  It seems like there might not be a good way to make
> sure I don't starve either ORB or use excess CPU by constantly checking both. 
> But maybe somebody's figured out how to do that well.

I was just thinking that, since you are using AMI, you should be able to
eliminate the 2nd ORB.  As you say you are already handing off work from the
"server" ORB's event loop to a worker thread, you already have asynchronous
handling of incoming requests.  If your worker threads would just make their
AMI invocations through the same "server" ORB, which already has a
perform_work() or run() event loop, then the AMI reply handler callbacks
would be dispatched in whatever thread(s) were running the ORB's event loop.
These would be synchronous callbacks, but perhaps you could employ some sort
of asynchronous handling of them, by always queuing the results and replying
immediately from the reply handler, then processing the results and updating
your state model via another thread.

> ...side note:  If there was a way to turn off nested upcalls for AMI (but use 1
> ORB), I wouldn't be causing such trouble :-).

Maybe the client strategy factory needs a new -ORBAMICallbackWaitStrategy
option!  (Just what TAO needs... more options.)
Comment 4 David Michael 2009-10-15 15:26:52 CDT
(In reply to comment #3)
> (In reply to comment #2)
> Congratulations on your new baby!
Thanks!

> Using AMH in conjunction with AMI, I think you could eliminate the "client-only
> ORB".  All incoming invocations would be handled asynchronously (the AMH side
> of things) and all outgoing invocations would be sent asynchronously (the AMI
> side of things).  Even the reply handler callbacks would be subject to the
> AMH model.  So, there should be no possibility of nested upcalls.
Hmm, I think I get what you're saying, but I'm not positive.  I'll have to think about it, but I'm a bit skeptical that we wouldn't still see nested upcalls.  We get nested upcalls in rare occasions for AMI requests, and it seems likely we'd get them in similar rare circumstances when sending an AMH response.  In either case, the nested upcall would be invoked on a thread other than the server thread.  I guess the saving grace is that the code for receiving the messages and dispatching work to the right worker threads could be simple, rock-solid, and thread safe.  And the nested upcalls wouldn't have the opportunity to blow up the stack, since no invocation would ever lead directly to another invocation in the same threading context.  I guess that's probably what you're saying.  That might be a possibility for me, it would just be a significant redesign.  I'll think about that.
 
> > * I've been trying like crazy to avoid nested upcalls because they cause so
> > much trouble for a multi-threaded app.  But I can't turn them off on my
> > "client" ORB because I'm using AMI.
> 
> Right, because options like -ORBWaitStrategy rw are incompatible with AMI.
> Those options are intended to help avoid nested upcalls when requests are
> sent synchronously.
> 
Ah, but what to do to avoid nested upcalls when requests are sent asynchronously...  that's the mystery I've been trying to solve :-)

> The "two-ORB" approach to avoiding nested upcalls is also intended for use
> when requests are sent synchronously.  When you use AMI, the "client ORB" has
> to become a "server ORB" (thus, the calls to perform_work()), so it defeats
> the purpose of separating server and client behavior between two ORBs.
> 
I've slowly been coming to that realization, that AMI and the two orb solution don't really mix.  I've been trying to get around it by making sure that all of the "server-like" work of the AMI calls still happens in a combination of a "client orb perform_work thread" and the server thread.  I guess even that might not be enough, since the AMI request could still make a nested upcall to handle an AMI reply...  yuck.

> I was just thinking that, since you are using AMI, you should be able to
> eliminate the 2nd ORB.  As you say you are already handing off work from the
> "server" ORB's event loop to a worker thread, you already have asynchronous
> handling of incoming requests.  If your worker threads would just make their
> AMI invocations through the same "server" ORB, which already has a
> perform_work() or run() event loop, then the AMI reply handler callbacks
> would be dispatched in whatever thread(s) were running the ORB's event loop.
> These would be synchronous callbacks, but perhaps you could employ some sort
> of asynchronous handling of them, by always queuing the results and replying
> immediately from the reply handler, then processing the results and updating
> your state model via another thread.
> 
If I cut down to 1 ORB, I have almost exactly the design you're talking about already.  The callbacks go through the same sort of quick hand-off to another thread.  The problem with it is that the AMI requests can also have nested upcalls...  which is why I've gone down this particular rabbit hole.  If a worker thread has a nested upcall on its AMI invocation, this is going to invoke stuff using the worker thread that I only expect to happen in the server thread.  That is unless I'm assuming too much.  I know this can happen in the single-threaded version of my software.  Maybe an AMI invocation won't do nested upcalls if there's another thread that's calling perform_work on the ORB???

> ...side note:  If there was a way to turn off nested upcalls for AMI (but use 1
> > ORB), I wouldn't be causing such trouble :-).
> 
> Maybe the client strategy factory needs a new -ORBAMICallbackWaitStrategy
> option!  (Just what TAO needs... more options.)
> 
It certainly is a challenge to figure out the right way to configure things...  but such an option would've saved me a LOT of time & trouble!