Please report new issues athttps://github.com/DOCGroup
At the end of TAO_Transport::send_asynchronous_message_i() when it attempts to flush the queue it first calls schedule_output() which marks corresponding handler in the reactor for output and then it releases handler_lock_ and calls flush_transport(). handler_lock_ is acquired later when draining the queue from handle_output(). Thus we have a situation when reactor already knows about pending work and handler_lock_ is unlocked which makes it possible for any other thread running reactor loop to actually drain the queue in our transport. In case of leader/follower flushing strategy which checks for queue_is_empty() before running the reactor loop it can happen that queue_is_empty() returns false but before reactor enters event loop in orb->run() the transport's queue is drained from the other thread and when max_wait_time is equal to 0 the loop never ends. This is what happens in Big_Request_Muxing on Solaris.
Created attachment 1163 [details] This patch fixes Big_Request_Muxing. But I'm not sure that it's the best thing we can do. What I'm sure of is that it doesn't make sense to check for queue_is_empty() in flush_transport(). And also probably running reactor forever (when max_wait_time == 0) was not intended in that method.
isn't the intention of the original code to just flush all methods in the queue and do that for maxtime? Now you run the orb for just a fixed amount of time, also doesn't seem the correct thing todo.
(In reply to comment #2) > isn't the intention of the original code to just flush all methods in the queue > and do that for maxtime? Now you run the orb for just a fixed amount of time, > also doesn't seem the correct thing todo. > If max_wait_time is not 0 then the code is same as it was before. The patch only fixes situation when orb->run() can run forever. Basically flush_transport() is only called from Transport.cpp and in all cases that call is preceded by shedule_output() which schedules work in the reactor. Since we only care about that pending work then we run the loop as long as work_pending() is true. Even if some other thread dispatches the work before we enter the reactor loop we will block at most for 100usecs.
the documentation of this method says: /// Wait until the transport has no messages queued. with your change we just run the orb for a short time, there is no check anymore whether the messages are all send this change just doesn't look ok
(In reply to comment #4) > the documentation of this method says: > > /// Wait until the transport has no messages queued. > > with your change we just run the orb for a short time, there is no check > anymore whether the messages are all send There is a while loop which will run orb as long as there are messages scheduled for output. It looks like you don't object that the problem exists. Then how do you think we can fix it in better way? I was asking for a better solution when submitting the patch. So, comments are still welcome.
(In reply to comment #5) > (In reply to comment #4) > > the documentation of this method says: > > > > /// Wait until the transport has no messages queued. > > > > with your change we just run the orb for a short time, there is no check > > anymore whether the messages are all send > > There is a while loop which will run orb as long as there are messages > scheduled for output. You loop until work pending, that is not the same as messages scheduled for output on this transport (so far as I know) > It looks like you don't object that the problem exists. Then how do you think > we can fix it in better way? I was asking for a better solution when submitting > the patch. So, comments are still welcome. I have not really looked at the problem, just the patch and that doesn't seem right. What about first putting the work into the queue and then schedule output?
(In reply to comment #6) > What about first putting the work into the queue and then schedule > output? > That's what is done in the Transport.cpp. It first puts a message in internal queue, then it calls TAO_Leader_Follower_Flushing_Strategy::schedule_output() (which schedules work in the reactor) and then call flush_transport() (which runs reactor). The later triggers a call to the TAO_Transport::handle_output() which drains that internal queue as far as it can. Also as far as I understand TAO_Transport::handle_output() returns 0 or -1 depending on whether there is more data in its internal queue or not. Depending on the returned value reactor decides whether more output is required. So, in this way we maintain a queue in the reactor similar to what we have in the transport. In other words as long as we return 0 reactor will know that more work is pending.
I don't have time to dive into details, but the queue_is_empty seems something that has to be checked
You made some changes last weeks that also should have solved these bugs, are they then real fixes, or just treating symptoms? Are they really ok, or maybe reconsider them?
(In reply to comment #9) > You made some changes last weeks that also should have solved these bugs, are > they then real fixes, or just treating symptoms? Are they really ok, or maybe > reconsider them? > I have not committed yet my changes for bug#3682. While testing that fix I discovered that Big_Request_Muxing is failing but later I realized that it fails from time to time even without my change. So, these bugs are related in a way they both can be reproduced with oneways on Solaris but they don't depend on each other.
In 85640. Mon Jun 15 10:19:16 UTC 2009 Vladimir Zykov <vz@prismtech.com> .... * tao/Leader_Follower_Flushing_Strategy.cpp: This fixes Bug#3697. The comment in the code explains why this fix is better than the code used before. This must fix Big_Request_Muxing on Solaris.
reopen, I just don't get why you don't check anymore the queue. what if data arrives on other threads for different transports, then you keep looping here
in how far should we use perform_work instead of run with a hardcoded default. what is the orb can send out the data directly
Johnny, I agree with your comments. Here is update to the code. In 85641. Mon Jun 15 12:45:55 UTC 2009 Vladimir Zykov <vz@prismtech.com> * tao/Leader_Follower_Flushing_Strategy.cpp: Improved the fix for Bug#3697. The while loop now depends on transpot's queue and the work in the reactor.
I would switch the if, first check if empty, then work_pending
Mon Jun 15 13:54:01 UTC 2009 Vladimir Zykov <vz@prismtech.com> * tao/Leader_Follower_Flushing_Strategy.cpp: Improved the fix for Bug#3697 even more.
my remarks should have been addressed before a commit. I do now seethat when timeout != 0, the loop is gone, is that intended, your remarks are always about max_wait_time == 0
(In reply to comment #17) > I do now seethat when > timeout != 0, the loop is gone, is that intended, your remarks are always about > max_wait_time == 0 > Yes, it was intended. If max_wait_time != 0 then orb will run for max_wait_time seconds and will reduce it by the time spent in orb->run(). There is no point in trying to do orb->run() again in that case.
Reverted the change in 85652 as there are problems. Tue Jun 16 07:06:14 UTC 2009 Vladimir Zykov <vz@prismtech.com>
Created attachment 1184 [details] An updated version of the previous patch.
In rev 86599. Thu Sep 3 09:01:53 UTC 2009 Vladimir Zykov <vz@prismtech.com> .... * tao/Leader_Follower_Flushing_Strategy.cpp: Changed the code to poll the reactor instead of running it indefinitely. This fixes bug#3697.
Created attachment 1193 [details] Additional patch for bug 1551. This additional patch must fix bug 1551 on Solaris. What I observed on Solaris is that if server dies when clients still tries to send requests (scenario of 1551) it calls TAO_Transport::schedule_output_i() and then tries to run reactor in TAO_Leader_Follower_Flushing_Strategy::flush_transport() to dispatch messages queued in the transport. However, the reactor's loop doesn't call TAO_Transport::handle_output() and even provided that max_wait_time has a specific value it's not decreased by the time spent in the reactor. So, we have that neither transport's queue is flushed nor timeout occurs and thus flush_transport() never exits. Concerning why handle_output() is not called. It seems that Solaris detects that the connection to the server is not valid any more and never marks in select() corresponding handle as ready for writing.
In rev 86658. Tue Sep 8 16:16:17 UTC 2009 Vladimir Zykov <vz@prismtech.com> * tao/Leader_Follower_Flushing_Strategy.cpp: Additional fix for bug#3697. It must fix Bug_1551_Regression which got broken by the original fix.
Previous change has broken Oneway_Send_Timeouts with LF flushing strategy on Linux. I'm looking into it.
In revision 86672. Wed Sep 9 12:38:15 UTC 2009 Vladimir Zykov <vz@prismtech.com> * tao/ORB_Core.cpp: * tao/Leader_Follower_Flushing_Strategy.cpp: * tao/Messaging/Messaging_Policy_i.cpp: * tao/ORB_Core.h: Reverted fixes for bug#3682 and bug#3697 as there are problems with them and the new x.7.3 release is very close.
I found why handle_output was not called. This happened because reactor was not running at the time we expect it to work. :) Instead of orb_core->run() which only checks for has_shutdown and returns we had to use orb_core->orb()->perform_work() which throws BAD_INV_ORDER in case orb was already shutdown (maybe in another thread). The same probably applies to other places where orb_core->run() is called in a loop with condition which is not directly related to the orb_core. There 3 more such places in TAO, namely TAO_Reactive_Flushing_Strategy::flush_message(), TAO_Reactive_Flushing_Strategy::flush_transport(), and TAO_Reactive_Connect_Strategy::wait_i(). An updated patch is coming.
In revision 88011. Wed Dec 9 09:40:10 UTC 2009 Vladimir Zykov <vladimir.zykov@prismtech.com> * tests/Bug_1361_Regression/Echo_Caller.cpp: * tests/Bug_1361_Regression/Echo_Caller.h: * tests/Bug_1361_Regression/server.cpp: * tests/Bug_1361_Regression/Server_Thread_Pool.cpp: * tests/Bug_1361_Regression/Server_Thread_Pool.h: * tests/Bug_1361_Regression/run_test.pl: Changed the test so that it doesn't shutdown the orb until all threads are done with the remote calls. Substantially extended the time for server shutdown since threads in server's pool don't handle shutdown message until they send all (50) remote messages. * tao/ORB_Core.cpp: * tao/Messaging/Messaging_Policy_i.cpp: * tao/ORB_Core.h: This fixes Bug#3682. SYNC_WITH_TRANSPORT is now really default synch scope policy in TAO. * tao/Leader_Follower_Flushing_Strategy.cpp: Changed the code to poll the reactor instead of running it indefinitely. This fixes bug#3697.
There are still problems with bug 1361 related to this fix. I'm trying to investigate them.
Now 1361 works on Solaris. So, I think this is fixed.