Bug 1025

Summary: ORB loses memory
Product: TAO Reporter: krc
Component: ORBAssignee: DOC Center Support List (internal) <tao-support>
Status: RESOLVED FIXED    
Severity: normal CC: reis
Priority: P3    
Version: 1.1.18   
Hardware: SPARC   
OS: Solaris   
Bug Depends on:    
Bug Blocks: 1277    

Description krc 2001-09-11 07:22:02 CDT
The ORB on the server side loses about 4k of memory every time a connection 
with a client goes away.  To reproduce the problem, do the following:

1.  Start up a simple server application.

2.  Run a simple client application which calls a CORBA method on the server 
application then immediately exits.  Repeatedly run the client application.  
(The simplest way to do this is to write a simple shell script with a loop that 
runs the client application several hundred times).

3.  As the client application is invoked over and over, monitor the memory use 
of the server application (using a Solaris utility such as "top").  The memory 
use of the server will continuously increase.  After every 1000 invocations of 
the client, the server's memory use will increase by about 4 Meg.
Comment 1 Chris Cleeland 2001-09-12 15:46:37 CDT
If we have a test case that consistently exhibits this behavior, that would be
great.  It would also be good to get a PRF for the system on which the
behavior's observed so we can verify the bug on that platform.

Does the problem happen with 1.1.19?
Comment 2 Nanbor Wang 2001-09-12 15:54:48 CDT
Chris,
We can see the problem clearly if we can run the client  again and again.  We
just dot remove cache entries on the server properly.  I am seeing where the
problem is.  It is being seen in 1.1.19 also.  We close handles, but we dont
remove the  entries from cache. We can write a test case for it. It is so
simple.  In the hello test, just make a remote call to the server. The server in
the process of the remote call should not see more than one connection in
cache.  The run_test.pl can make around 30 to 40 clients connect to the server
one after another and the above check should still be valid.

I think I sent you a mail on this
Comment 3 Nanbor Wang 2001-09-13 13:57:27 CDT
This problems has been fixed. We also have test case in
$TAO_ROOT/tests/Cache_Growth_Test that checks whether  the server's cache grows
in size.  Here is the relevant ChangeLog entry

Thu Sep 13 12:46:45 2001  Balachandran Natarajan  <bala@cs.wustl.edu>

This was easy to figure out. I  have tested and made sure that memory doesnt
grow. But we still dont have a fool proof method testing that.


Comment 4 Nanbor Wang 2001-09-26 16:30:39 CDT
Breaking up bug 1038 into two bugs: 

1) The deadlock on the client due to a socket failure on the server (bug 1038)

2) ORB loses memory. (this bug)

BTW, the title used to be "ORB loses memory whenever a connection with client 
goes away" but the testcase from 1038 shows that it can happen without the 
client "going away".

An additional problem that Bala has asked me to log along with this bug is 
that -ORBFlushingStrategy blocking does not seem to work properly. This is 
based on conjecture from observations seen with the memory leaks. 

The memory that is being leaked is from copy of the buffer that the TP_Reactor 
makes as when the socket becomes blocked and the thread is freed up to go 
service other things. If I set the -ORBFlushingStrategy to blocking, I would 
expect that a buffer copy would NOT occur since the thread blocks on the socket 
and is not allowed to return to the pool until the data is successful sent to 
the client. Therefore, I should not see the memory leak reported in this bug if 
I run in that mode, but I do.

I have included a testcase that creates a global buffer (buffer_) that is 
reused by all request to the client using: 

return new File::Descriptor::DataBuffer (num_bytes,
                                         num_bytes,
                                         buffer_,
                                         0);

This testcase is helpfull in reproducing this bug and the original bug that 
initiated this bug report. The new testcase seems to produce the bug faster 
than the origninal one. (see bug 1038 for the testcase)


------- Additional Comments From Balachandran Natarajan 2001-09-26 09:19 -------

Here is my analysis of why there is a memory hog. Below is the extract from my
email to folks like Chris (C), Chad and Carlos. 

bala@cs.wustl.edu writes:
--------------------
I think we have a semi-serious problem on hand. This gives rise to
memory growth problems and it can be serious. This can also lead to
weird behaviours which I am not ready to think about.

Here is how it occurs. Imagine a a server with two threads and a
client with say 10 threads. The server as a part of the reply needs to
send large data, say 10 MB. Below is a sequence of steps

1. The connections are muxed. So there may be more than one message in
   a single read for a thread

2. As a result of the above we send notifies to the reactor, which
   unblocks the next thread and sends it to the same handler. 

3. At step #2 the number of upcalls on the handler is 3. As there are
   two threads in the handler this is fine. 

4. The first thread tries to send a 10 MB reply. It is not able to, and
   so schedules output and waits on the reactor.

5. The same happens for thread 2. Note at this point the number of
   upcalls havent dropped as the handle_input () is not complete for
   both the threads

6. As the threads have gone to the reactor, the reactor can
   potentially wake them up and send to the same handler as there
   could be more events.

7. Now the upcall goes to 5 and both the threads go to the reactor
   again.  

8. After sometime, if the client crashes, we just call handle_close ()
   which decrements the upcall count by 1. If the upcall count is not
   0 we dont do anything. We expect all the threads (only 2 in our
   case) to unwind and decrement the upcall count.

9. That just doesnt happen and there is a leak. Worse still -- the
   transports are in cache.

10. The above could happen for >=1 connection in the above setup.

11. The worst part would start occuring if the servers were really
    making remote calls back to the client. They could try to call
    clients that are not available and possibly crash.

I have been able to reproduce whatever I have said (from 1-10). Any
ideas? The memory hog comes from the lingering transports, the messages queued
on them and its own memory. 

-------End mail-------------------------------------
I dont think we have time before 1.2/1.2.1 to fix these.  We need to look into
these once we start working on connections again, may be when GIOP 1.3 comes out
or when we work on the reactor or anything connected to  this area.



Comment 5 Nanbor Wang 2001-09-26 17:13:18 CDT
Accepting the bug  and added Jonathan Reis <reis@stentor.com> to the CC list
Comment 6 Carlos O'Ryan 2002-08-12 18:38:15 CDT
Please re-examine this bug, after the fixes for 1202 (and its children) the
connections should be closed.  If you have a regression test please run it and
let us know if there are any remaining issues or if we can close this bug.

I am also marking this bug as a blocker for 1.3, resource leakage is blocker for
a major release.
Comment 7 Nanbor Wang 2003-01-15 12:20:45 CST
Based on fixes for 1202, 1020 etc. I think we have nailed this down. Please run
some regressions at your end. If you still find problems please register a new
bug since the focus of this bug is not clear.