Bug 2228 - Big_Request_Muxing test deadlocks
Summary: Big_Request_Muxing test deadlocks
Status: REOPENED
Alias: None
Product: TAO
Classification: Unclassified
Component: Test (show other bugs)
Version: 1.4.6
Hardware: All All
: P3 blocker
Assignee: DOC Center Support List (internal)
URL:
Depends on:
Blocks:
 
Reported: 2005-08-31 14:49 CDT by ciju john
Modified: 2007-07-25 13:13 CDT (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description ciju john 2005-08-31 14:49:01 CDT
This report describes the deadlock discovered in an alternate version of the
'ACE_wrappers/TAO/tests/Big_Request_Muxing' test. The tarred code for this was
sent out on the devo-group. Hopefully it will be incorporated into the DOC repo,
else I will cut paste the relevant changes into the ticket.

Broadly in this modified test, the client spawns 6 threads. Each thread pair
modifies it Messaging::SYNC_SCOPE_POLICY_TYPE to Messaging::SYNC_WITH_TARGET,
Messaging::SYNC_WITH_TARGET & Messaging::SYNC_NONE respectively. Then each
thread invokes more_data() 1000 times. The deadlock happens on one of the
Messaging::SYNC_NONE threads. Heres the stack trace:

#0  0x4093da27 in select () from /lib/tls/libc.so.6
#1  0x40620099 in ACE_OS::select (width=10, rfds=0x807a8dc, wfds=0x0, efds=0x0,
timeout=0x0) at OS_NS_sys_select.inl:39
#2  0x40634ed0 in ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token>
>::wait_for_multiple_events (this=0x807a3d0,
    dispatch_set=@0x807a8d0, max_wait_time=0x0) at Select_Reactor_T.cpp:1142
#3  0x4064403f in ACE_TP_Reactor::get_event_for_dispatching (this=0x807a3d0,
max_wait_time=0x0) at TP_Reactor.cpp:556
#4  0x40643a85 in ACE_TP_Reactor::dispatch_i (this=0x807a3d0, max_wait_time=0x0,
guard=@0x429b2330) at TP_Reactor.cpp:250
#5  0x406438e6 in ACE_TP_Reactor::handle_events (this=0x807a3d0,
max_wait_time=0x0) at TP_Reactor.cpp:172
#6  0x4063f91d in ACE_Reactor::handle_events (this=0x807ad20, max_wait_time=0x0)
at Reactor.inl:166
#7  0x4043f860 in TAO_ORB_Core::run (this=0x806d580, tv=0x0, perform_work=1) at
ORB_Core.cpp:1878
#8  0x4044a133 in TAO_Leader_Follower_Flushing_Strategy::flush_transport
(this=0x807b5d8, transport=0x80816d8)
    at Leader_Follower_Flushing_Strategy.cpp:55
#9  0x40373711 in TAO_Transport::send_message_shared_i (this=0x80816d8,
stub=0x807d1c8, message_semantics=0, message_block=0x8087cc0,
    max_wait_time=0x0) at Transport.cpp:1187
#10 0x4038433f in TAO_IIOP_Transport::send_message_shared (this=0x80816d8,
stub=0x807d1c8, message_semantics=0,
    message_block=0x8087cc0, max_wait_time=0x0) at IIOP_Transport.cpp:187
#11 0x4038424e in TAO_IIOP_Transport::send_message (this=0x80816d8,
stream=@0x8087cc0, stub=0x807d1c8, message_semantics=0,
    max_wait_time=0x0) at IIOP_Transport.cpp:156
#12 0x403841bf in TAO_IIOP_Transport::send_request (this=0x80816d8,
stub=0x807d1c8, orb_core=0x806d580, stream=@0x8087cc0,
    message_semantics=0, max_wait_time=0x0) at IIOP_Transport.cpp:133
#13 0x40405b54 in TAO::Remote_Invocation::send_message (this=0x429b2670,
cdr=@0x8087cc0, message_semantics=0, max_wait_time=0x0)
    at Remote_Invocation.cpp:165
#14 0x40407b3a in TAO::Synch_Oneway_Invocation::remote_oneway (this=0x429b2670,
max_wait_time=0x0) at Synch_Invocation.cpp:715
#15 0x40404377 in TAO::Invocation_Adapter::invoke_oneway (this=0x429b2870,
details=@0x429b27e0, effective_target=@0x429b27a0,
    r=@0x429b2720, max_wait_time=@0x429b2798) at Invocation_Adapter.cpp:342
#16 0x4040410b in TAO::Invocation_Adapter::invoke_remote_i (this=0x429b2870,
stub=0x807d1c8, details=@0x429b27e0,
    effective_target=@0x429b27a0, max_wait_time=@0x429b2798) at
Invocation_Adapter.cpp:268
#17 0x40403bc5 in TAO::Invocation_Adapter::invoke_i (this=0x429b2870,
stub=0x807d1c8, details=@0x429b27e0)
    at Invocation_Adapter.cpp:86
#18 0x40403ab8 in TAO::Invocation_Adapter::invoke (this=0x429b2870, ex_data=0x0,
ex_count=0) at Invocation_Adapter.cpp:45
#19 0x0804fa4b in Test::Payload_Receiver::more_data (this=0x805d5d8,
the_payload=@0x429b2900) at TestC.cpp:255
#20 0x080525a0 in Client_Task::validate_connection (this=0xbffff5b0) at
Client_Task.cpp:97
#21 0x0805210b in Client_Task::svc (this=0xbffff5b0) at Client_Task.cpp:43
#22 0x406cc487 in ACE_Task_Base::svc_run (args=0xbffff5b0) at Task.cpp:204
#23 0x406899c6 in ACE_Thread_Adapter::invoke_i (this=0x807cac0) at
Thread_Adapter.cpp:150
#24 0x406898ff in ACE_Thread_Adapter::invoke (this=0x807cac0) at
Thread_Adapter.cpp:94
#25 0x4064f262 in ace_thread_adapter (args=0x807cac0) at Base_Thread_Adapter.cpp:132
#26 0x40748b63 in start_thread () from /lib/tls/libpthread.so.0
#27 0x4094418a in clone () from /lib/tls/libc.so.6

As can be seen the client thread is waiting for events on the select. What
normally happens is this:

- The transport in TAO_Transport::send_message_shared_i() upon detecting that
its queue needs to be flushed sets the ACE_Event_Handler::WRITE_MASK for its
event handler in TAO_Transport::schedule_output_i().
- It then calls flush_transport() on its flushing strategy, which goes into the
select and upon receiving the WRITE_MASK trigger will flush out the queue.

However in this deadlock, as can be seen the select has NULL for the write
masks. Somehow this mask got cleared before the thread entered select. Since
there is no event trigger, the thread deadlocks.

I am debugging this for now and could use any help/advise.

Ciju
Comment 1 ciju john 2006-10-17 16:28:07 CDT
I believe this issue is no longer valid. Marking it as invalid.

Ciju
Comment 2 Johnny Willemsen 2006-10-18 01:11:19 CDT
John, is the changes regression you mention in svn? I am reopening this, I am
not sure if you else get my question? If the regression is not in svn, could you
add it for this bug?
Comment 3 ciju john 2007-07-25 13:13:28 CDT
The said modifications to Big_Request_Muxing test were added in:

Thu Sep  1 16:56:12 2005  Ciju John  <john_c@ociweb.com>

I believe this issue may still be valid as even after recent extensive modifications from Simon Massey, there still seems to be an indication of a deadlock happening. Look at the test failure on Solaris10_Studio11_Debug from today (2007_07_25_07_29). The test failed because the server timed out waiting for  the client packages. I don't believe this is related to the test timeout. However this failure is so rare and I am not sure if this really is a deadlock problem. Since the test passes on most platforms, I recommend marking this issue as resolved.

Ciju