Bug 3569 - TAO behaves badly when having more then 1024 connections
Summary: TAO behaves badly when having more then 1024 connections
Status: NEW
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.6.8
Hardware: All Linux
: P3 critical
Assignee: DOC Center Support List (internal)
URL:
Depends on:
Blocks: 3531
  Show dependency tree
 
Reported: 2009-02-11 05:41 CST by Johnny Willemsen
Modified: 2010-08-20 12:11 CDT (History)
1 user (show)

See Also:


Attachments
Server log at ORBDebugLevel 10 (257.24 KB, text/x-log)
2009-02-11 05:46 CST, Johnny Willemsen
Details
Callstacks of all the threads (48.90 KB, text/plain)
2009-02-11 05:48 CST, Johnny Willemsen
Details
Callstacks with inline=0/optimize=0 (66.16 KB, text/plain)
2009-02-11 07:10 CST, Johnny Willemsen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Johnny Willemsen 2009-02-11 05:41:48 CST
we have a master that pings 500 clients with 20 threads as fast as possible. open_files limit is set to 8000 so that we have a transport cache of 4000 entries. After running for around 15 seconds the master just stops with the ping
Comment 1 Johnny Willemsen 2009-02-11 05:46:54 CST
Created attachment 1073 [details]
Server log at ORBDebugLevel 10
Comment 2 Johnny Willemsen 2009-02-11 05:48:06 CST
Created attachment 1074 [details]
Callstacks of all the threads
Comment 3 Johnny Willemsen 2009-02-11 06:52:09 CST
tried the patch of 3531 but then the master doesn't hang but loops forever with the logging below (level 10). No calls are handled anymore

Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.812 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) enter reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Follower[1017]::wait_for_event, (leader) exit reactor event loop
Feb 11 13:49:11.813 2009@LM_DEBUG@TAO (12682|1235728704) - Leader_Foll
[
Comment 4 Johnny Willemsen 2009-02-11 07:10:02 CST
Created attachment 1075 [details]
Callstacks with inline=0/optimize=0

another set of callstacks of a run with inline=0/optimize=0. As a test I changed all members of Leader_Follower to Atomic_Op to make sure they are not causing a race condition
Comment 5 Johnny Willemsen 2009-02-11 07:29:10 CST
(In reply to comment #4)
> Created an attachment (id=1075) [details]
> Callstacks with inline=0/optimize=0
> 
> another set of callstacks of a run with inline=0/optimize=0. As a test I
> changed all members of Leader_Follower to Atomic_Op to make sure they are not
> causing a race condition
> 

thread 16 is just looping on the reactor
Comment 6 Russell Mora 2009-02-12 09:39:07 CST
This looks like the problem that I experienced with the initial patch for 3531 - the client leader thread is not successfully surrendering leadership to the waiting event loop thread.  Instead it just keeps on giving up and re-acquiring leadship. :-(

Try the new patch in 3531 - it should be better in this respect (however, it is still not 100% reliable, read the notes in 3531).

I don't think though this is relevant to the original problem - the first set of callstacks looks normal - looks like the orb is just waiting for replies.  Are the replies arriving at all?

(BTW, in gdb try the command "thread apply all bt" :-) )
Comment 7 Johnny Willemsen 2009-02-12 09:40:44 CST
(In reply to comment #6)
> This looks like the problem that I experienced with the initial patch for 3531
> - the client leader thread is not successfully surrendering leadership to the
> waiting event loop thread.  Instead it just keeps on giving up and re-acquiring
> leadship. :-(
> 
> Try the new patch in 3531 - it should be better in this respect (however, it is
> still not 100% reliable, read the notes in 3531).
> 
> I don't think though this is relevant to the original problem - the first set
> of callstacks looks normal - looks like the orb is just waiting for replies. 
> Are the replies arriving at all?
> 
> (BTW, in gdb try the command "thread apply all bt" :-) )
> 

yes, this seems to be a different problem, I think we get more then 1024 handles in select.
Comment 8 Johnny Willemsen 2009-03-07 13:47:23 CST
updated summary, this is all about the behaviour when TAO has more then 1024 connections, meaning we call select with more then 1024 handles, then the leader thread just loops forever and no calls are handled anymore, the server doesn't respond to any client request