Bug 3682 - Problem with oneways on Solaris
Summary: Problem with oneways on Solaris
Status: RESOLVED FIXED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.6.8
Hardware: All Solaris
: P3 normal
Assignee: Vladimir Zykov
URL:
Depends on: 3683 3704
Blocks: 3773
  Show dependency tree
 
Reported: 2009-05-27 08:36 CDT by Vladimir Zykov
Modified: 2011-03-03 06:25 CST (History)
0 users

See Also:


Attachments
A proposed fix (1.45 KB, patch)
2009-05-27 09:11 CDT, Vladimir Zykov
Details
A new fix against recent code from SVN (1.96 KB, patch)
2009-05-28 09:15 CDT, Vladimir Zykov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vladimir Zykov 2009-05-27 08:36:59 CDT
Oneways in TAO are implemented as asynchronous messages. The code responsible for sending them is in TAO_Transport::send_asynchronous_message_i(). It happens that TAO tries to send the messages as far as it can and if it cannot it puts new messages in a queue inside TAO_Transport. Later depending on the buffering constraints and the flushing strategy it either sends the messages immediately (blocking strategy) or schedule output in the ORB's reactor (leader_follower and reactive strategies). The later way is problematic for Solaris builds. I observed the following problem on that system. If there are too many oneways are sent very fast then at some point write() to socket returns with errno=EAGAIN and the message is put on the queue as I mentioned above. However, the constraints are such that the flushing code is never called.

At the same time TAO assumes sync scope = Messaging::SYNC_WITH_TRANSPORT by default for oneway calls which is not always consistent through the code. It looks like TAO needs a default hook set in TAO_ORB_Core::sync_scope_hook_ and that will set sync scope appropriately (i.e. to Messaging::SYNC_WITH_TRANSPORT).

The reproducer for this issue is a Single_Read test.
Comment 1 Johnny Willemsen 2009-05-27 08:44:49 CDT
check also the changes of Carlos of yesterday
Comment 2 Vladimir Zykov 2009-05-27 09:11:16 CDT
Created attachment 1156 [details]
A proposed fix

It fixes Single_Read and AMH_Oneway tests on Solaris.
Comment 3 Vladimir Zykov 2009-05-27 09:12:17 CDT
(In reply to comment #1)
> check also the changes of Carlos of yesterday
> 

I'll check my fix with the latest code from SVN.
Comment 4 Johnny Willemsen 2009-05-27 09:13:15 CDT
make sure argument names in header/cpp are listed the same, else doxygen gets confused.
Comment 5 Vladimir Zykov 2009-05-28 09:15:40 CDT
Created attachment 1157 [details]
A new fix against recent code from SVN

The first one was not complete. However, with this new fix Big_AMI starts failing (again on Solaris). This happens because client in that test assumes that until synchronous message is sent no replies can come from server which is not true if we use sync scope SYNC_WITH_TRANSPORT which means that we run reactor for flushing queued messages and as a side effect we receive replies from server. So, I think Big_AMI has to be changed.
Comment 6 Johnny Willemsen 2009-05-29 07:18:50 CDT
Vladimir, can you also have a look at 3683? This seems related, sending a lot of data using rw.
Comment 7 Vladimir Zykov 2009-05-29 07:51:31 CDT
(In reply to comment #6)
> Vladimir, can you also have a look at 3683? This seems related, sending a lot
> of data using rw.
> 

Johnny, how should it fail? In my local OpenSolaris environment (it was the fastest to check) 3683 works. I checked it with my patch applied.
Comment 8 Johnny Willemsen 2009-05-29 07:57:00 CDT
I have to commit an updated run-test.pl for you to run it. it failed on one host here that client just hangs forever. I am not sure if you are also testing with RW strategy, but it seems client stratey and flushing strategy have some dependency
Comment 9 Vladimir Zykov 2009-06-15 06:13:05 CDT
In 85640.

Mon Jun 15 10:19:16 UTC 2009  Vladimir Zykov  <vz@prismtech.com>

....
        * tests/Big_AMI/client.cpp:
        * tests/Portable_Interceptors/AMI/client.cpp:
        * tests/Bug_1270_Regression/client.cpp:
        * tests/Bug_1270_Regression/Echo.cpp:
        * tests/Bug_1270_Regression/server.cpp:

          Fixed tests after the change for Bug#3682. In these tests it
          was assumed that nothing could be received from server until
          we run orb explicitly. The later is not true with synch scope
          policy SYNC_WITH_TRANSPORT.

....
        * tao/ORB_Core.cpp:
        * tao/ORB_Core.h:
        * tao/Messaging/Messaging_Policy_i.cpp:

          This fixes Bug#3682. SYNC_WITH_TRANSPORT is now really
          default synch scope policy in TAO. This must fix Single_Read
          and AMH_Oneway tests on Solaris.
Comment 10 Vladimir Zykov 2009-06-16 02:11:47 CDT
Reverted the change in 85652 as there are problems.

Tue Jun 16 07:06:14 UTC 2009  Vladimir Zykov  <vz@prismtech.com>
Comment 11 Vladimir Zykov 2009-09-03 04:21:55 CDT
In rev 86599.

Thu Sep  3 09:01:53 UTC 2009  Vladimir Zykov  <vz@prismtech.com>

....
        * tests/Big_AMI/client.cpp:
        * tests/Bug_1361_Regression/Echo_Caller.cpp:
        * tests/Bug_1361_Regression/server.cpp:
        * tests/Bug_1361_Regression/Server_Timer.cpp:
        * tests/Bug_1361_Regression/Server_Timer.h:
        * tests/Bug_1361_Regression/client.cpp:
        * tests/Bug_1361_Regression/Echo.cpp:
        * tests/Bug_1361_Regression/Server_Thread_Pool.cpp:
        * tests/Portable_Interceptors/AMI/client.cpp:
        * tests/Bug_1270_Regression/client.cpp:
        * tests/Bug_1270_Regression/Echo.cpp:
        * tests/Bug_1270_Regression/Echo_Caller.cpp:
        * tests/Bug_1270_Regression/server.cpp:
        * tests/Bug_1270_Regression/Server_Timer.cpp:
        * tests/Bug_1270_Regression/Server_Timer.h:
        * tests/Bug_1270_Regression/run_test.pl:

          Fixed tests after the change for Bug#3682 and Bug#3697. In
          some of these tests it was assumed that nothing could be
          received from server until we run orb explicitly. The later
          is not true with synch scope policy SYNC_WITH_TRANSPORT. Cleaned
          up the code of the tests.

        * tao/ORB_Core.cpp:
        * tao/Messaging/Messaging_Policy_i.cpp:
        * tao/ORB_Core.h:

          This fixes Bug#3682. SYNC_WITH_TRANSPORT is now really
          default synch scope policy in TAO. This must fix Single_Read
          and AMH_Oneway tests on Solaris.
....
Comment 12 Vladimir Zykov 2009-09-09 07:43:20 CDT
In revision 86672.

Wed Sep  9 12:38:15 UTC 2009  Vladimir Zykov  <vz@prismtech.com>

        * tao/ORB_Core.cpp:
        * tao/Leader_Follower_Flushing_Strategy.cpp:
        * tao/Messaging/Messaging_Policy_i.cpp:
        * tao/ORB_Core.h:

          Reverted fixes for bug#3682 and bug#3697 as there are
          problems with them and the new x.7.3 release is very close.
Comment 13 Vladimir Zykov 2009-12-09 03:52:32 CST
In revision 88011.

Wed Dec  9 09:40:10 UTC 2009  Vladimir Zykov  <vladimir.zykov@prismtech.com>

        * tests/Bug_1361_Regression/Echo_Caller.cpp:
        * tests/Bug_1361_Regression/Echo_Caller.h:
        * tests/Bug_1361_Regression/server.cpp:
        * tests/Bug_1361_Regression/Server_Thread_Pool.cpp:
        * tests/Bug_1361_Regression/Server_Thread_Pool.h:
        * tests/Bug_1361_Regression/run_test.pl:
          Changed the test so that it doesn't shutdown the orb until
          all threads are done with the remote calls. Substantially
          extended the time for server shutdown since threads in server's
          pool don't handle shutdown message until they send all (50)
          remote messages.

        * tao/ORB_Core.cpp:
        * tao/Messaging/Messaging_Policy_i.cpp:
        * tao/ORB_Core.h:
          This fixes Bug#3682. SYNC_WITH_TRANSPORT is now really
          default synch scope policy in TAO.

        * tao/Leader_Follower_Flushing_Strategy.cpp:
          Changed the code to poll the reactor instead of running
          it indefinitely. This fixes bug#3697.
Comment 14 Vladimir Zykov 2011-03-03 06:25:45 CST
This bug is fixed. The last problems for which I had to reopen it are fixed now.