Summary: | ORB loses memory | ||
---|---|---|---|
Product: | TAO | Reporter: | krc |
Component: | ORB | Assignee: | DOC Center Support List (internal) <tao-support> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | reis |
Priority: | P3 | ||
Version: | 1.1.18 | ||
Hardware: | SPARC | ||
OS: | Solaris | ||
Bug Depends on: | |||
Bug Blocks: | 1277 |
Description
krc
2001-09-11 07:22:02 CDT
If we have a test case that consistently exhibits this behavior, that would be great. It would also be good to get a PRF for the system on which the behavior's observed so we can verify the bug on that platform. Does the problem happen with 1.1.19? Chris, We can see the problem clearly if we can run the client again and again. We just dot remove cache entries on the server properly. I am seeing where the problem is. It is being seen in 1.1.19 also. We close handles, but we dont remove the entries from cache. We can write a test case for it. It is so simple. In the hello test, just make a remote call to the server. The server in the process of the remote call should not see more than one connection in cache. The run_test.pl can make around 30 to 40 clients connect to the server one after another and the above check should still be valid. I think I sent you a mail on this This problems has been fixed. We also have test case in $TAO_ROOT/tests/Cache_Growth_Test that checks whether the server's cache grows in size. Here is the relevant ChangeLog entry Thu Sep 13 12:46:45 2001 Balachandran Natarajan <bala@cs.wustl.edu> This was easy to figure out. I have tested and made sure that memory doesnt grow. But we still dont have a fool proof method testing that. Breaking up bug 1038 into two bugs: 1) The deadlock on the client due to a socket failure on the server (bug 1038) 2) ORB loses memory. (this bug) BTW, the title used to be "ORB loses memory whenever a connection with client goes away" but the testcase from 1038 shows that it can happen without the client "going away". An additional problem that Bala has asked me to log along with this bug is that -ORBFlushingStrategy blocking does not seem to work properly. This is based on conjecture from observations seen with the memory leaks. The memory that is being leaked is from copy of the buffer that the TP_Reactor makes as when the socket becomes blocked and the thread is freed up to go service other things. If I set the -ORBFlushingStrategy to blocking, I would expect that a buffer copy would NOT occur since the thread blocks on the socket and is not allowed to return to the pool until the data is successful sent to the client. Therefore, I should not see the memory leak reported in this bug if I run in that mode, but I do. I have included a testcase that creates a global buffer (buffer_) that is reused by all request to the client using: return new File::Descriptor::DataBuffer (num_bytes, num_bytes, buffer_, 0); This testcase is helpfull in reproducing this bug and the original bug that initiated this bug report. The new testcase seems to produce the bug faster than the origninal one. (see bug 1038 for the testcase) ------- Additional Comments From Balachandran Natarajan 2001-09-26 09:19 ------- Here is my analysis of why there is a memory hog. Below is the extract from my email to folks like Chris (C), Chad and Carlos. bala@cs.wustl.edu writes: -------------------- I think we have a semi-serious problem on hand. This gives rise to memory growth problems and it can be serious. This can also lead to weird behaviours which I am not ready to think about. Here is how it occurs. Imagine a a server with two threads and a client with say 10 threads. The server as a part of the reply needs to send large data, say 10 MB. Below is a sequence of steps 1. The connections are muxed. So there may be more than one message in a single read for a thread 2. As a result of the above we send notifies to the reactor, which unblocks the next thread and sends it to the same handler. 3. At step #2 the number of upcalls on the handler is 3. As there are two threads in the handler this is fine. 4. The first thread tries to send a 10 MB reply. It is not able to, and so schedules output and waits on the reactor. 5. The same happens for thread 2. Note at this point the number of upcalls havent dropped as the handle_input () is not complete for both the threads 6. As the threads have gone to the reactor, the reactor can potentially wake them up and send to the same handler as there could be more events. 7. Now the upcall goes to 5 and both the threads go to the reactor again. 8. After sometime, if the client crashes, we just call handle_close () which decrements the upcall count by 1. If the upcall count is not 0 we dont do anything. We expect all the threads (only 2 in our case) to unwind and decrement the upcall count. 9. That just doesnt happen and there is a leak. Worse still -- the transports are in cache. 10. The above could happen for >=1 connection in the above setup. 11. The worst part would start occuring if the servers were really making remote calls back to the client. They could try to call clients that are not available and possibly crash. I have been able to reproduce whatever I have said (from 1-10). Any ideas? The memory hog comes from the lingering transports, the messages queued on them and its own memory. -------End mail------------------------------------- I dont think we have time before 1.2/1.2.1 to fix these. We need to look into these once we start working on connections again, may be when GIOP 1.3 comes out or when we work on the reactor or anything connected to this area. Accepting the bug and added Jonathan Reis <reis@stentor.com> to the CC list Please re-examine this bug, after the fixes for 1202 (and its children) the connections should be closed. If you have a regression test please run it and let us know if there are any remaining issues or if we can close this bug. I am also marking this bug as a blocker for 1.3, resource leakage is blocker for a major release. Based on fixes for 1202, 1020 etc. I think we have nailed this down. Please run some regressions at your end. If you still find problems please register a new bug since the focus of this bug is not clear. |