Bug 2574 - Multiple connection requests cause core dump with SYNC_NONE
Summary: Multiple connection requests cause core dump with SYNC_NONE
Status: ASSIGNED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.5.1
Hardware: other Linux
: P3 critical
Assignee: DOC Center Support List (internal)
URL:
Depends on: 2189
Blocks:
  Show dependency tree
 
Reported: 2006-06-29 16:25 CDT by Jody Hagins
Modified: 2007-12-17 13:08 CST (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jody Hagins 2006-06-29 16:25:43 CDT
When the first message is sent with SYNC_NONE, it puts the message on a queue,
and tries to establish the connection.  When a subsequent message is sent, it
tries to "reuse" the existing transport.  However, some refcount is getting
hosed because it dumps core.

You already have an example that should cause it... the Timed_Buffered_Oneways
test.  Give the client a '-e' on the command line to make it use eager buffering.

This is reproducable in other programs as well.
Comment 1 Jody Hagins 2006-06-29 23:01:30 CDT
Actually, it is not just on multiple connection attempts.

In addition, it seems that the reactor loop is entered, which could cause
"other" stuff to happen.  Sending a SYNC_NONE / TAO_EAGER_BUFFERING should shove
it on a queue and not call the event loop at all.  I believe the crash in the
test program is due to something happening in the event loop.

The event_handler reference count is decremented because
IIOP_Connection_Handler::close() is being called.  However, the reference count
is now out of whack because the caller to wait_for_connection_completion() is
holding a reference, and there is also a reference in the cache.  The close()
method calls remove_reference(), which means now that the reference count is off.

Comment 2 Johnny Willemsen 2006-06-30 01:53:11 CDT
Some issues related to this are fixed already which will be part of the upcoming
x.5.2, could you retest when this is available?
Comment 3 Jody Hagins 2006-06-30 07:38:50 CDT
I can surely retest.  Do I need to wait for the release, or is there a way I can
grab the "fixed" code and retest?

Do the fixes also address the issue of going into the reactor loop?  I really do
not think the reactor loop should be entered for EAGER_BUFFERING.

In addition, I think DELAYED_BUFFERING should also have similar behavior (i.e.,
not enter the event loop -- use existing transport -- not initiate multiple
connections).

In my understanding, the only difference between EAGER_BUFFERING and
DELAYED_BUFFERING is that EAGER will always enqueue, and DELAYED will try the
write before queueing (if the queue is empty and the connection exists).

BTW, unless there are other pieces of code that specifically check for
SYNC_NONE/EAGER_BUFFERING, I believe a simple change to support
DELAYED_BUFFERING in a similar fashion is to use something similar to 


    Profile_Transport_Resolver resolver (
      effective_target.in (),
      stub,
      (details.response_flags () != CORBA::Octet (Messaging::SYNC_NONE))
          && (details.response_flags () !=
              CORBA::Octet (TAO::SYNC_DELAYED_BUFFERING)));


which would make DELAYED_BUFFERING be "blocked_connect" as well (assuming that's
still how it is done).
Comment 4 Jody Hagins 2006-06-30 11:44:56 CDT
OK.  I got the current CVS, and reran the Timed_Buffered_Oneways test with "-e"
on the client.

The crash no longer happens, but the server never gets any messages.  A quick
strace shows the the client makes a successful TCP/IP connection to the server,
and the client "sends" all its messages, but the server never receives a single
message (it never comes off its select call until the client goes away and the
TCP/IP connection is closed).

Maybe that test should be run multiple times, with each possible command line
argument setting.

I have not looked deeply into WHY the messages are not being delivered, but it
looks like the client is setting a buffering constraint of BUFFER_FLUSH, so I'd
expect to at least see all but the last message...
Comment 5 Jody Hagins 2006-06-30 18:15:30 CDT
I looked at this a little more...

The client starts the connection, then goes into the event loop to wait for the
connection.  In this event loop, a timeout is called on the handler and it is
closed, setting the LF_Event state to CLOSED.  The code opening the connection
sees the connect error, and calls reset_state() to set it back to WAITING. 
However, nothing else ever calls back to reset the state of the LF_Event object.
 In addition, open() is never being called on the connection handler.

I can see from strace that the connection is happening, both in the client and
the server.  So, the connection is successful, but the code appears to have
decided to ignore anything on that transport connection.




NOTE: If I set "-t 0" then the connection succeeds.  I think there is some
strange interplay when setting RELATIVE_RT_TIMEOUT_POLICY_TYPE.  The
ReliableOneways seem to be OK (though I've not done extensive testing).

Even if I set this value to something very large the timeout still happens on
the first connect attempt and the connection handler interprets that as a
connection failure.
Comment 6 Johnny Willemsen 2006-08-09 09:08:17 CDT
accept on behalf of tao-support
Comment 7 Johnny Willemsen 2006-11-24 09:03:58 CST
Jody, isn't this one of the issues that ATD funded OCI to fix?
Comment 8 Johnny Willemsen 2007-12-17 13:08:16 CST
Do you have a regression test?