Please report new issues athttps://github.com/DOCGroup
When naming service is stopped, CosNaming::NamingContext::resolve() crashes on a client. Here's the call stack of the crash: ---------------------------------- ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token> >::remove_handler (ACE_Event_Handler * 0x02c50850, unsigned long 1023) line 368 + 12 bytes ACE_Reactor::remove_handler(ACE_Event_Handler * 0x02c50850, unsigned long 1023) line 316 TAO_Transport::close_connection_i() line 697 TAO_Transport::close_connection() line 656 TAO_GIOP_Invocation::close_connection() line 536 TAO_ORB_Core::service_raise_comm_failure(TAO_GIOP_Invocation * 0x0184f6d0, TAO_Profile * 0x012cbeb0, CORBA_Environment & {...}) line 1501 TAO_GIOP_Synch_Invocation::invoke_i(unsigned char 0, CORBA_Environment & {...}) line 768 + 41 bytes TAO_GIOP_Twoway_Invocation::invoke(TAO_Exception_Data * 0x69494060 _tao_CosNaming_NamingContext_resolve_exceptiondata, unsigned int 3, CORBA_Environment & {...}) line 910 + 17 bytes CosNaming::_TAO_NamingContext_Remote_Proxy_Impl::resolve(CORBA_Object * 0x02c505e8, const CosNaming::Name & {...}, CORBA_Environment & {...}) line 2998 + 23 bytes CosNaming::NamingContext::resolve(const CosNaming::Name & {...}, CORBA_Environment & {...}) line 5206 NAMING_CLIENT::NamingClient::resolveOrCreateContext(CosNaming::NamingContext * 0x02c505d8, const char * 0x02cdd2a8) line 337 + 39 bytes NAMING_CLIENT::NamingClient::set(const ACE_CString & {...}, CORBA_Object * 0x02cb75d8) line 76 + 93 bytes AdvertiseCommand::Execute_i() line 73 + 74 bytes MethodRequest::Execute() line 280 + 13 bytes ActiveObj::svc() line 84 + 8 bytes ACE_Task_Base::svc_run(void * 0x01225428) line 203 + 11 bytes ACE_Thread_Hook::start(void * (void *)* 0x69004ade ACE_Task_Base::svc_run(void *), void * 0x01225428) line 12 + 7 bytes Thread_Hook::start(void * (void *)* 0x69004ade ACE_Task_Base::svc_run(void *), void * 0x01225428) line 19 + 17 bytes ACE_Thread_Adapter::invoke_i() line 141 + 18 bytes ACE_Thread_Adapter::invoke() line 91 + 11 bytes ace_thread_adapter(void * 0x0123f208) line 127 + 10 bytes _threadstartex(void * 0x0123f270) line 212 + 13 bytes KERNEL32! 77e837cd() Here are the local variables for TAO_Transport::close_connection_i() line 697: ----------------------------------------------------------------------------- + this 0x02c50960 + i 0x00000003 - eh 0x02c50850 + __vfptr 0xdddddddd priority_ -572662307 + reactor_ 0xdddddddd
Changed the summary. Looks like the following bug report from Bruce Trask is also related. From: Bruce Trask <BTRASK@contactsystems.com> TAO VERSION: 1.1.19 ACE VERSION: 5.1.19 HOST MACHINE and OPERATING SYSTEM: Win 2K TARGET MACHINE and OPERATING SYSTEM, if different from HOST: Win 2K COMPILER NAME AND VERSION (AND PATCHLEVEL): Microsoft VC6 SP4 AREA/CLASS/EXAMPLE AFFECTED: Thread Pool Reactor DOES THE PROBLEM AFFECT: COMPILATION? No. LINKING? No. EXECUTION? Yes. See below. OTHER (please specify)? SYNOPSIS: Client application core dumps when server goes away and comes back on. Details below. DESCRIPTION: I have a client application (application A) that also functions as a server. This application uses the TP Reactor and spawns 3 threads and calls orb->run() from them. Here's the scenario, normally application A is in communication with another machine running application B. Application A makes remote calls on application B. During normal operation of the system, there are times where application B will become disconnected, be reset, or powered down and then come back up. When application B comes back up, he let's the event service app know that he is ready to go and the ES informs application A that application B is up an running. Application A then tries to communicate with application B. Then the problem occurs. During its attempt to communicate with application B this time, application A closes some connections and sometimes there is a core dump. What I am seeing is this: One thread is calling this function: void TAO_Transport::close_connection_i (void) { ACE_Event_Handler *eh = 0; { ACE_MT (ACE_GUARD (ACE_Lock, guard, *this->handler_lock_)); eh = this->event_handler_i (); this->transition_handler_state_i (); if (eh == 0) return; } // Close the underlying connection, it is enough to get an // Event_Handler pointer to do this, so we can factor out the code // in the base TAO_Transport class. // We first try to remove the handler from the reactor. After that // we destroy the handler using handle_close (). The remove handler // is necessary because if the handle_closed is called directly, the // reactor would be left with a dangling pointer. if (this->ws_->is_registered ()) { this->orb_core_->reactor ()->remove_handler ( eh, ACE_Event_Handler::ALL_EVENTS_MASK | ACE_Event_Handler::DONT_CALL); } (void) eh->handle_close (ACE_INVALID_HANDLE, ACE_Event_Handler::ALL_EVENTS_MASK); for (TAO_Queued_Message *i = this->head_; i != 0; i = i->next ()) { i->state_changed (TAO_LF_Event::LFS_CONNECTION_CLOSED); } } if (this->ws_->is_registered ()) { this->orb_core_->reactor ()->remove_handler ( eh, ACE_Event_Handler::ALL_EVENTS_MASK | ACE_Event_Handler::DONT_CALL); } (void) eh->handle_close (ACE_INVALID_HANDLE, ACE_Event_Handler::ALL_EVENTS_MASK); for (TAO_Queued_Message *i = this->head_; i != 0; i = i->next ()) { i->state_changed (TAO_LF_Event::LFS_CONNECTION_CLOSED); } } this function is called as part of the following call stack TAO_Transport::close_connection_i() TAO_Transport::close_connection() TAO_GIOP_Invocation::close_connection() TAO_ORB_Core::service_raise_comm_failure() TAO_GIOP_Synch_Invocation::invoke_i() TAO_GIOP_Twoway_Invocation::invoke() ... application called function. meanwhile,at the same time, for some reason another thread is executing and here is its call stack (pseudo names) ACE_Svc_Handler::destroy() TAO_IIOP_Connection_Handler::handle_close() ACE_Select_Reactor_Handle_Repository::unbind() ACE_Select_Reactor_T::remove_handler_i() ACE_Select_Reactor_T::remove_handler() ACE_TP_Reactor::notify_handle() ACE_TP_Reactor::handle_events() ACE_Reactor::handle_events() TAO_ORB_Core::run() CORBA_ORB::run() and it is during the call to destroy that the event handler commits suicide (delete this). template <PR_ST_1, ACE_SYNCH_DECL> void ACE_Svc_Handler<PR_ST_2, ACE_SYNCH_USE>::destroy (void) { ACE_TRACE ("ACE_Svc_Handler<PR_ST_2, ACE_SYNCH_USE>::destroy"); // Only delete ourselves if we're not owned by a module and have // been allocated dynamically. if (this->mod_ == 0 && this->dynamic_ && this->closing_ == 0) // Will call the destructor, which automatically calls <shutdown>. // Note that if we are *not* allocated dynamically then the // destructor will call <shutdown> automatically when it gets run // during cleanup. delete this; } What seems to be happening is that there is a race between the two threads above and there are times where the first thread gets to the eh->handle_close() in the first code snippet but by the time it gets there the eh's guts have already been deleted by the delete this in the other thread. Most times, everything works OK and I don't get this situation but intermittently I do see the crash. It seems to happen if this mystery-call to destroy happens after the first scoped section in the close_connnection_i function above. Most times it happens before so it is OK in which case eh is 0 and the close_connection_i function just returns. So, my questions are: 1. Anyone else seen anything like this? 2. I can't seem to determine what is setting off this second call stack from the handle_events() that ultimately calls delete this on the event handler that is being used by the other thread. It seems to come out of nowhere. Anything I'm missing here? 3. For lack of knowing why the second thread is doing what its doing, it seems that there may be more locking needed to protect against the event handler being ripped out from underneath the close_connection call. Am I correct? I'm still looking into this one but wanted a sanity check from the community. Thanks in advance for any input. Here is my diagnosis of the problem. This is part of the mail that I exchanged with Chris & Carlos. My mails point this as the reason ------------------------------------------------------------ Carlos & Chris- Can you guys please look at bug reports 1020 & the one Bruce Trask (in ace-users) sent in yesterday. The problem for both the cases is from here void TAO_Transport::close_connection_i (void) { ....... // Close the underlying connection, it is enough to get an // Event_Handler pointer to do this, so we can factor out the code // in the base TAO_Transport class. // We first try to remove the handler from the reactor. After that // we destroy the handler using handle_close (). The remove handler // is necessary because if the handle_closed is called directly, the // reactor would be left with a dangling pointer. if (this->ws_->is_registered ()) { this->orb_core_->reactor ()->remove_handler ( eh, ACE_Event_Handler::ALL_EVENTS_MASK | ACE_Event_Handler::DONT_CALL); } (void) eh->handle_close (ACE_INVALID_HANDLE, ACE_Event_Handler::ALL_EVENTS_MASK); ........ } The problem as I see it is, one thread that works on close_connection_i () and the other is in handle events () trying to dispatch for a failure.. One of the threads in the above function, marks handle for deletion but it is not deleted yet. Before the thread gets to handle_close () the thread in the reactor wakes up and calls handle_close (). The thread in the above method is left high and dry. The solution would be to allow the thread to do everything in one shot ie. just call remove_handler () without any mask. That way, the reactor would hold the token before doing any changes to the internals. The second thread wouldnt be dispatched as the thread in the above function holds the token. IMHO, the code should read {... if (this->ws_->is_registered ()) { this->orb_core_->reactor ()->remove_handler ( eh); } else { // For Wait on RW case (void) eh->handle_close (ACE_INVALID_HANDLE, ACE_Event_Handler::ALL_EVENTS_MASK); } .... } --------------------------------------------------------------------------- A more detailed mail was also sent. The problem is a bit more complex than I first thought. This is thought process on this ---------------------------------------------------------------------------- Ok, here is the problem again. One of the threads tried making an invocation, found that the connection is useless, tries closing the connection. At the same moment another thread waiting on select () received the message and goes to close connection. The thread in the reactor, does these (the thread on select holds the reactors token) - unblocked from select () because of closed connection - picks up the handle from the map that needs attention - suspends the handle so that other threads dont dispatch to the same handle. - releases the token, (this is a very important step) - calls handle_* method, which in this case is first a handle_input () and then a remove_handler (). Let us now look at the thread who is going to close connection from TAO_Transport::close_connection_i (): - Calls remove_handler () on the reactor - grabs the internal token to the reactor, (very important step) - the remove_handler () calls unbind () in its internal map. - the unbind () call does * clears out the bit for the handle in the wait set and suspend set * marks state changed in the reactor * calls handle_close (). Note this is an upcall. * and then some shutdown related stuff which is not important for the discussion Ok, now imagine this race condition. The second thread from close_connection_i (), calling handle_close () from unbind (), and the thread in the reactor trying to make an upcall. The upcall can potentally fail, because the handle ccould have been deleted if the other thread was faster. The problem will not appear if the thread in the reactor has gone inside handle_input (). Then we have sufficient mechanisms to keep the handler alive. But just imagine a case when one thread is done with handle_close () and the other thread trying to make an upcall -- because the thread in the reactor thought it was safe. As a matter of fact it is. The problem IMHO is kicking in from the thread that comes from TAO's close_connection_i (). For me the code in unbind () is simply broken int ACE_Select_Reactor_Handler_Repository::unbind (ACE_HANDLE handle, ACE_Reactor_Mask mask) { .... // Clear out the <mask> bits in the Select_Reactor's wait_set. this->select_reactor_.bit_ops (handle, mask, this->select_reactor_.wait_set_, ACE_Reactor::CLR_MASK); // And suspend_set. this->select_reactor_.bit_ops (handle, mask, this->select_reactor_.suspend_set_, ACE_Reactor::CLR_MASK); // Note the fact that we've changed the state of the <wait_set_>, // which is used by the dispatching loop to determine whether it can // keep going or if it needs to reconsult select(). this->select_reactor_.state_changed_ = 1; // Close down the <Event_Handler> unless we've been instructed not // to. if (ACE_BIT_ENABLED (mask, ACE_Event_Handler::DONT_CALL) == 0) eh->handle_close (handle, mask); .... } Why are we calling handle_close () on an handler without even checking whether the handler is suspended or not? Can we add a check? We can, but that will not solve the issue. If at all the handler is suspended what do we do? Just get out without closing the handler? May be that will stop the crash but there *could* be a leak.
Here is another way we could try fixing the problem. - overload the remove_handler () call in the reactor. - if the handle that needs to be removed is suspended, then we know in the TP_Reactor that a thread is dispatching to that handler. We just mark the handle as "Need to be removed". - After the dispatching thread returns, use the same thread that did dispatching to remove the handler Here are some caveats * We need to maintian a handle set in the TP_Reactor * A thread after dispatching needs to grab the token before seeing whether the handle that it has just dispatched needs to be removed. This would mean waking up another thread. Else we can protect the handle set using some light-weight mechanism. This could make things a bit slower.
I modified the /TAO_ROOT/tests/MT_Server test to duplicate the bug. I tried to put the file up as an attachment to this bug but there was some error. I have a mail out to the webmaster to help. In the meantime, here is the file's contents. It is a slightly modified version of the client.cpp in the MT_Server test. Just replace client.cpp with this one and run the server and then the client on Win2K and the problem should occur. // client.cpp,v 1.3 2000/09/13 22:22:36 coryan Exp #include "ace/Get_Opt.h" #include "ace/Task.h" #include "testC.h" ACE_RCSID(MT_Server, client, "client.cpp,v 1.3 2000/09/13 22:22:36 coryan Exp") const char *ior = "file://test.ior"; int niterations = 5; int do_shutdown = 0; int parse_args (int argc, char *argv[]) { ACE_Get_Opt get_opts (argc, argv, "xk:i:"); int c; while ((c = get_opts ()) != -1) switch (c) { case 'x': do_shutdown = 1; break; case 'k': ior = get_opts.optarg; break; case 'i': niterations = ACE_OS::atoi (get_opts.optarg); break; case '?': default: ACE_ERROR_RETURN ((LM_ERROR, "usage: %s " "-k <ior> " "-i <niterations> " "\n", argv [0]), -1); } // Indicates sucessful parsing of the command line return 0; } class Worker : public ACE_Task_Base { // = TITLE // Run a server thread // // = DESCRIPTION // Use the ACE_Task_Base class to run server threads // public: Worker (CORBA::ORB_ptr orb); // ctor virtual int svc (void); // The thread entry point. private: CORBA::ORB_var orb_; // The orb }; int main (int argc, char *argv[]) { ACE_TRY_NEW_ENV { CORBA::ORB_var orb = CORBA::ORB_init (argc, argv, "", ACE_TRY_ENV); ACE_TRY_CHECK; if (parse_args (argc, argv) != 0) return 1; ////////////////////////////////////// // Fire up a few threads which are driving handle_events. This is is just to cause the bug 1020 Worker worker (orb.in ()); if (worker.activate (THR_NEW_LWP | THR_JOINABLE, 3) != 0) ACE_ERROR_RETURN ((LM_ERROR, "Cannot activate client threads\n"), 1); while(1) { /////////////////////////////////////// try { CORBA::Object_var object = orb->string_to_object (ior, ACE_TRY_ENV); ACE_TRY_CHECK; Simple_Server_var server = Simple_Server::_narrow (object.in (), ACE_TRY_ENV); ACE_TRY_CHECK; if (CORBA::is_nil (server.in ())) { ACE_ERROR_RETURN ((LM_ERROR, "Object reference <%s> is nil\n", ior), 1); } for (int i = 0; i != niterations; ++i) { CORBA::Long r = server->test_method (i, ACE_TRY_ENV); ACE_TRY_CHECK; // This is is just to cause the bug 1020 if ((i%3) == 0 && i!=0) { server->shutdown (ACE_TRY_ENV); ACE_TRY_CHECK; } if (r != i) { ACE_DEBUG ((LM_DEBUG, "(%P|%t) unexpected result = %d for %d", r, i)); } } if (do_shutdown) { server->shutdown (ACE_TRY_ENV); ACE_TRY_CHECK; } } catch(CORBA::SystemException& ex) { } } } ACE_CATCHANY { ACE_PRINT_EXCEPTION (ACE_ANY_EXCEPTION, "Exception caught:"); return 1; } ACE_ENDTRY; return 0; } // **************************************************************** Worker::Worker (CORBA::ORB_ptr orb) : orb_ (CORBA::ORB::_duplicate (orb)) { } int Worker::svc (void) { ACE_DECLARE_NEW_CORBA_ENV; ACE_TRY { this->orb_->run (ACE_TRY_ENV); ACE_TRY_CHECK; } ACE_CATCHANY { } ACE_ENDTRY; return 0; } Hope this helps, Bruce btrask@contactsystems.com
Accepted.
Assigning the bug to me
Accepting the bug
*** Bug 1233 has been marked as a duplicate of this bug. ***
Created attachment 130 [details] tarred gzipped file
The last attachemnt is for bug 1020 only. The name of the file is misleading. Forget the name of the file, the contents are for this bug only.
Crashes in the ORB are blockers to the 1.3 release
*** Bug 1322 has been marked as a duplicate of this bug. ***
Added tao-support to the CC list
Some sort of break through with this bug. A quick background Problem: -------- When you have a multi-threaded client communicating with a server, and if the server happens to die the client crashed horribly. Reason: ------- When the server crashed, and if the situation is like this thread 1: Trying to write a request thread 2...X: In the reactor driving the handle_events () loop we got this crash. The real reason behind this is that "thread 1" tried closing connections and one of the threads (2 through X) was woken up by the reactor to close the same connection. So we had two threads trying to close connections without any synchronization. Any usage of synchronization primitives, like mutex., refcount or whatever, lead to a chicken and egg problem. This invariably used to crash. Proposed Solution: ------------------ Had the following guiding principle 1. Only one thread will close the connection 2. The other thread should know that and just leave without trying to close again. The fix leveraged the existing fixes from BUG 1269 and BUG 1270. In the scenario presented above, instead of allowing thread 1 to close the connection, we now allow the thread to call handle_events () in the reactor with a zero timeout value in a loo. After every call to handle_events () the status of the connection (as maintained by the connection handler) is checked. Doing this helped to achieve the following 1. If there had been another thread (2 through X) dispatched to close the connection, then "thread 1" will not get the connection since the handle is suspended when dispatch occurs 2. If threads (2 through X) actually closed the connection, "thread 1" can monitor the status after the call to handle_events () 3. "thread 1" will not block since it uses a zero timeout 4. If there were no threads other "thread 1" in the Reactor, "thread 1" will get to know the closure event first and it woulds proceed to close the connection. On return from the call to handle_events () "thread 1" would check the status and then exit I have been testing this with $TAO_ROOT/tests/Bug_1020_Basic_Regression on charanga.cs and tao.isis. Seems to be going well so far. I have to test this more with other tests that have been reported in the past. I will attach patches to what is described above shortly. Shortcoming in the patch: ------------------------- Need to add a check which prevents "thread 1" from calling handle_events () if the handle has not been registered with the Reactor. Side effect: ------------ Needed to touch TAO_LF_Event class to show the state changes. Purpose of this entry: ---------------------- Need feedback from all qaurters on whether this approach is right or am I missing something very very very obvious. Critical comments are welcome.
Created attachment 156 [details] patches for 1020
Just a point. Using Leader_Follower instead of the reactor would be better. Use the same idea and use the LF instead of Reactor.
*** Bug 1354 has been marked as a duplicate of this bug. ***
Created attachment 159 [details] patches to exisitng files
Created attachment 160 [details] New files
The patches in http://deuce.doc.wustl.edu/bugzilla/showattachment.cgi?attach_id=159 actually show the diffs between existing files. What has been done 1. The transport on detecting a fault in its output path calls a leader follower with a timeout. (The timeout can be made an ORB option if needed). 2. The LF_Event class has been changed considerably. We mixed the state changes in the invocation path with the connection establishment path. Since it was confusing the TAO_LF_Event class is now made a abstract base class. 3. Concrete classes, TAO_LF_Invocation_Event which does the work done in TAO_LF_Event, and TAO_LF_CH_Event which takes care of non-blocking connection establishment and termination in case of failures of established connections, are now added. 4. The new files are in http://deuce.doc.wustl.edu/bugzilla/showattachment.cgi?attach_id=160 These are the changes that are suggested for this bug. The Bug_1020_Basic_Regression now works fine. There is yet another problem that this bug opened up which showed up from Veritas test cases. There is going to be seperate report on that.
A work around has been added to TAO 1.2.6 and here is the relevant ChangeLog entry Tue Nov 12 12:47:31 2002 Balachandran Natarajan <bala@isis-server.isis.vanderb ilt.edu> There are three tests in TAO that pass. I am not closing this bug since this is a workaround.
Fixed. This ChangeLogTag: Mon Jul 07 18:00:38 2003 Irfan Pyarali <irfan@oomworks.com> describes the proper fixes to these problems.