Summary: | Unmarshalling/invoking AMI replies rapidly leads to deep recursion and Seg Fault (worked in 1.4.8) | ||
---|---|---|---|
Product: | TAO | Reporter: | David Michael <damicha> |
Component: | AMI | Assignee: | DOC Center Support List (internal) <tao-support> |
Status: | NEW --- | ||
Severity: | normal | CC: | totten_s |
Priority: | P3 | ||
Version: | 1.7.1 | ||
Hardware: | All | ||
OS: | Linux | ||
Attachments: | Multi-threaded client with two orbs (client & server) that exhibits the problem |
Description
David Michael
2009-08-24 18:11:30 CDT
Hi David, Your deep stack trace indicates your client worker thread is experiencing nested upcalls. Your AMI reply handler is making a synchronous twoway invocation on an object hosted by the server ORB. While waiting for a reply, it receives another AMI callback and makes another synchronous twoway invocation. While waiting for a reply, it receives another AMI callback, and so on. You could try allowing collocation by removing the "-ORBCollocation no" option. Then, the twoway invocations from the reply handler would by invoked on the same thread as the one making the call and that thread would not be subject to nested AMI callbacks. However, I believe you turned off collocation at my suggestion in a previous bug report. (I would have to go back and review it to recall the details.) I think you are running into issues that are very common for "middle-tier" server/client applications. Your application is performing the roles of both server and client, which makes it difficult to scale up without hitting problems such as deeply nested upcalls (or, in this case, deeply nested AMI callbacks). You are on the right track by using AMI. But, AMI alone does not go far enough. It only makes the client side of your application behave asynchronously. You should consider adding TAO's Asynchronous Method Handling (AMH) to make the server side also behave asynchronously. We at OCI have found this combination of AMH with AMI to be a very powerful technique for making middle-tier applications scalable and efficient. Here are some resources about AMH/AMI for learning more: http://www.cs.wustl.edu/~schmidt/PDF/AMH.pdf http://cnb.ociweb.com/cnb/CORBANewsBrief-200308.html In addition, OCI's TAO Developer's Guide contains a chapter about AMH that details how to combine AMH and AMI in a middle-tier application. In fact, we have just finished a new version of the Developer's Guide that is consistent with the OCI TAO 1.6a release. The new guide will be available for purchase from our web site very soon. Contact sales@ociweb.com in the meantime. Finally, OCI's Advanced CORBA Programming with TAO training course offers a module on AMH/AMI for middle-tier applications. See http://www.ociweb.com/training/Distributed-Computing for more information. Steve Sorry for taking so long to get back to you, my wife had a baby shortly after I submitted, and then other priorities trumped this work for a while. Steve, you nailed it... this is a middle-tier kind of program I'm working on. I wondered if I could badger you with a couple of further questions. If you don't have time or need to check to see if we're covered for support, that's fine. But hopefully you can answer quickly off the top of your well-informed head :-) * You mention AMH as a possible solution. I read up a little on it, and I'm concerned that it wouldn't really solve my problem. The "reply" that gets sent from the client worker thread is actually handled very rapidly on the server side (this obviously is also true in the attached code). There's very little work that my lone "server" thread does; it pretty much always sends a job off almost immediately to some other thread (a lot like I would do with AMH, I expect). I don't think AMH would make it significantly faster, so I'm wondering if I could still run in to the same situation. It seems like the client worker thread could still do nested upcalls if replies are coming in rapidly, since it doesn't know or care how calls get handled by the server. Am I missing something here? * I've been trying like crazy to avoid nested upcalls because they cause so much trouble for a multi-threaded app. But I can't turn them off on my "client" ORB because I'm using AMI. The whole reason I'm using two orbs is to protect myself from nested upcalls, but you're right, they apparently still happen under the hood in the AMI reply handling code (which of course uses SMI). This behavior seems to me to be a bug in TAO, if for no other reason than because my timeout parameter to ORB::perform_work gets ignored completely in this case. For what it's worth, this stack overflow doesn't seem to happen with 1.4.8, so in some sense this is a regression (although it isn't tested). What do you think, bug or no bug? (By the way, I'm not sure in what version it changed; we just happen to be using 1.4.8 currently). * Can I somehow workaround this by having both ORBs perform work in 1 thread (and allow collocation)? It seems like there might not be a good way to make sure I don't starve either ORB or use excess CPU by constantly checking both. But maybe somebody's figured out how to do that well. ...side note: If there was a way to turn off nested upcalls for AMI (but use 1 ORB), I wouldn't be causing such trouble :-). Hi Dave, (In reply to comment #2) > Sorry for taking so long to get back to you, my wife had a baby shortly after I > submitted, and then other priorities trumped this work for a while. Congratulations on your new baby! > Steve, you nailed it... this is a middle-tier kind of program I'm working on. > I wondered if I could badger you with a couple of further questions. If you > don't have time or need to check to see if we're covered for support, that's > fine. But hopefully you can answer quickly off the top of your well-informed > head :-) No problem. > * You mention AMH as a possible solution. I read up a little on it, and I'm > concerned that it wouldn't really solve my problem. The "reply" that gets sent > from the client worker thread is actually handled very rapidly on the server > side (this obviously is also true in the attached code). There's very little > work that my lone "server" thread does; it pretty much always sends a job off > almost immediately to some other thread (a lot like I would do with AMH, I > expect). I don't think AMH would make it significantly faster, so I'm > wondering if I could still run in to the same situation. It seems like the > client worker thread could still do nested upcalls if replies are coming in > rapidly, since it doesn't know or care how calls get handled by the server. Am > I missing something here? Using AMH in conjunction with AMI, I think you could eliminate the "client-only ORB". All incoming invocations would be handled asynchronously (the AMH side of things) and all outgoing invocations would be sent asynchronously (the AMI side of things). Even the reply handler callbacks would be subject to the AMH model. So, there should be no possibility of nested upcalls. > * I've been trying like crazy to avoid nested upcalls because they cause so > much trouble for a multi-threaded app. But I can't turn them off on my > "client" ORB because I'm using AMI. Right, because options like -ORBWaitStrategy rw are incompatible with AMI. Those options are intended to help avoid nested upcalls when requests are sent synchronously. > The whole reason I'm using two orbs is to > protect myself from nested upcalls, but you're right, they apparently still > happen under the hood in the AMI reply handling code (which of course uses > SMI). The "two-ORB" approach to avoiding nested upcalls is also intended for use when requests are sent synchronously. When you use AMI, the "client ORB" has to become a "server ORB" (thus, the calls to perform_work()), so it defeats the purpose of separating server and client behavior between two ORBs. > This behavior seems to me to be a bug in TAO, if for no other reason > than because my timeout parameter to ORB::perform_work gets ignored completely > in this case. For what it's worth, this stack overflow doesn't seem to happen > with 1.4.8, so in some sense this is a regression (although it isn't tested). > What do you think, bug or no bug? (By the way, I'm not sure in what version it > changed; we just happen to be using 1.4.8 currently). The differences from 1.4.8 to 1.7.1 are numerous, to say the least. It would take a lot of digging to try to figure out just what change introduced this new problem you are encountering. > * Can I somehow workaround this by having both ORBs perform work in 1 thread > (and allow collocation)? It seems like there might not be a good way to make > sure I don't starve either ORB or use excess CPU by constantly checking both. > But maybe somebody's figured out how to do that well. I was just thinking that, since you are using AMI, you should be able to eliminate the 2nd ORB. As you say you are already handing off work from the "server" ORB's event loop to a worker thread, you already have asynchronous handling of incoming requests. If your worker threads would just make their AMI invocations through the same "server" ORB, which already has a perform_work() or run() event loop, then the AMI reply handler callbacks would be dispatched in whatever thread(s) were running the ORB's event loop. These would be synchronous callbacks, but perhaps you could employ some sort of asynchronous handling of them, by always queuing the results and replying immediately from the reply handler, then processing the results and updating your state model via another thread. > ...side note: If there was a way to turn off nested upcalls for AMI (but use 1 > ORB), I wouldn't be causing such trouble :-). Maybe the client strategy factory needs a new -ORBAMICallbackWaitStrategy option! (Just what TAO needs... more options.) (In reply to comment #3) > (In reply to comment #2) > Congratulations on your new baby! Thanks! > Using AMH in conjunction with AMI, I think you could eliminate the "client-only > ORB". All incoming invocations would be handled asynchronously (the AMH side > of things) and all outgoing invocations would be sent asynchronously (the AMI > side of things). Even the reply handler callbacks would be subject to the > AMH model. So, there should be no possibility of nested upcalls. Hmm, I think I get what you're saying, but I'm not positive. I'll have to think about it, but I'm a bit skeptical that we wouldn't still see nested upcalls. We get nested upcalls in rare occasions for AMI requests, and it seems likely we'd get them in similar rare circumstances when sending an AMH response. In either case, the nested upcall would be invoked on a thread other than the server thread. I guess the saving grace is that the code for receiving the messages and dispatching work to the right worker threads could be simple, rock-solid, and thread safe. And the nested upcalls wouldn't have the opportunity to blow up the stack, since no invocation would ever lead directly to another invocation in the same threading context. I guess that's probably what you're saying. That might be a possibility for me, it would just be a significant redesign. I'll think about that. > > * I've been trying like crazy to avoid nested upcalls because they cause so > > much trouble for a multi-threaded app. But I can't turn them off on my > > "client" ORB because I'm using AMI. > > Right, because options like -ORBWaitStrategy rw are incompatible with AMI. > Those options are intended to help avoid nested upcalls when requests are > sent synchronously. > Ah, but what to do to avoid nested upcalls when requests are sent asynchronously... that's the mystery I've been trying to solve :-) > The "two-ORB" approach to avoiding nested upcalls is also intended for use > when requests are sent synchronously. When you use AMI, the "client ORB" has > to become a "server ORB" (thus, the calls to perform_work()), so it defeats > the purpose of separating server and client behavior between two ORBs. > I've slowly been coming to that realization, that AMI and the two orb solution don't really mix. I've been trying to get around it by making sure that all of the "server-like" work of the AMI calls still happens in a combination of a "client orb perform_work thread" and the server thread. I guess even that might not be enough, since the AMI request could still make a nested upcall to handle an AMI reply... yuck. > I was just thinking that, since you are using AMI, you should be able to > eliminate the 2nd ORB. As you say you are already handing off work from the > "server" ORB's event loop to a worker thread, you already have asynchronous > handling of incoming requests. If your worker threads would just make their > AMI invocations through the same "server" ORB, which already has a > perform_work() or run() event loop, then the AMI reply handler callbacks > would be dispatched in whatever thread(s) were running the ORB's event loop. > These would be synchronous callbacks, but perhaps you could employ some sort > of asynchronous handling of them, by always queuing the results and replying > immediately from the reply handler, then processing the results and updating > your state model via another thread. > If I cut down to 1 ORB, I have almost exactly the design you're talking about already. The callbacks go through the same sort of quick hand-off to another thread. The problem with it is that the AMI requests can also have nested upcalls... which is why I've gone down this particular rabbit hole. If a worker thread has a nested upcall on its AMI invocation, this is going to invoke stuff using the worker thread that I only expect to happen in the server thread. That is unless I'm assuming too much. I know this can happen in the single-threaded version of my software. Maybe an AMI invocation won't do nested upcalls if there's another thread that's calling perform_work on the ORB??? > ...side note: If there was a way to turn off nested upcalls for AMI (but use 1 > > ORB), I wouldn't be causing such trouble :-). > > Maybe the client strategy factory needs a new -ORBAMICallbackWaitStrategy > option! (Just what TAO needs... more options.) > It certainly is a challenge to figure out the right way to configure things... but such an option would've saved me a LOT of time & trouble! |