Please report new issues athttps://github.com/DOCGroup
When the first message is sent with SYNC_NONE, it puts the message on a queue, and tries to establish the connection. When a subsequent message is sent, it tries to "reuse" the existing transport. However, some refcount is getting hosed because it dumps core. You already have an example that should cause it... the Timed_Buffered_Oneways test. Give the client a '-e' on the command line to make it use eager buffering. This is reproducable in other programs as well.
Actually, it is not just on multiple connection attempts. In addition, it seems that the reactor loop is entered, which could cause "other" stuff to happen. Sending a SYNC_NONE / TAO_EAGER_BUFFERING should shove it on a queue and not call the event loop at all. I believe the crash in the test program is due to something happening in the event loop. The event_handler reference count is decremented because IIOP_Connection_Handler::close() is being called. However, the reference count is now out of whack because the caller to wait_for_connection_completion() is holding a reference, and there is also a reference in the cache. The close() method calls remove_reference(), which means now that the reference count is off.
Some issues related to this are fixed already which will be part of the upcoming x.5.2, could you retest when this is available?
I can surely retest. Do I need to wait for the release, or is there a way I can grab the "fixed" code and retest? Do the fixes also address the issue of going into the reactor loop? I really do not think the reactor loop should be entered for EAGER_BUFFERING. In addition, I think DELAYED_BUFFERING should also have similar behavior (i.e., not enter the event loop -- use existing transport -- not initiate multiple connections). In my understanding, the only difference between EAGER_BUFFERING and DELAYED_BUFFERING is that EAGER will always enqueue, and DELAYED will try the write before queueing (if the queue is empty and the connection exists). BTW, unless there are other pieces of code that specifically check for SYNC_NONE/EAGER_BUFFERING, I believe a simple change to support DELAYED_BUFFERING in a similar fashion is to use something similar to Profile_Transport_Resolver resolver ( effective_target.in (), stub, (details.response_flags () != CORBA::Octet (Messaging::SYNC_NONE)) && (details.response_flags () != CORBA::Octet (TAO::SYNC_DELAYED_BUFFERING))); which would make DELAYED_BUFFERING be "blocked_connect" as well (assuming that's still how it is done).
OK. I got the current CVS, and reran the Timed_Buffered_Oneways test with "-e" on the client. The crash no longer happens, but the server never gets any messages. A quick strace shows the the client makes a successful TCP/IP connection to the server, and the client "sends" all its messages, but the server never receives a single message (it never comes off its select call until the client goes away and the TCP/IP connection is closed). Maybe that test should be run multiple times, with each possible command line argument setting. I have not looked deeply into WHY the messages are not being delivered, but it looks like the client is setting a buffering constraint of BUFFER_FLUSH, so I'd expect to at least see all but the last message...
I looked at this a little more... The client starts the connection, then goes into the event loop to wait for the connection. In this event loop, a timeout is called on the handler and it is closed, setting the LF_Event state to CLOSED. The code opening the connection sees the connect error, and calls reset_state() to set it back to WAITING. However, nothing else ever calls back to reset the state of the LF_Event object. In addition, open() is never being called on the connection handler. I can see from strace that the connection is happening, both in the client and the server. So, the connection is successful, but the code appears to have decided to ignore anything on that transport connection. NOTE: If I set "-t 0" then the connection succeeds. I think there is some strange interplay when setting RELATIVE_RT_TIMEOUT_POLICY_TYPE. The ReliableOneways seem to be OK (though I've not done extensive testing). Even if I set this value to something very large the timeout still happens on the first connect attempt and the connection handler interprets that as a connection failure.
accept on behalf of tao-support
Jody, isn't this one of the issues that ATD funded OCI to fix?
Do you have a regression test?