Bug 2994 - Deadlock in Two_DLL_ORB
Summary: Deadlock in Two_DLL_ORB
Status: ASSIGNED
Alias: None
Product: TAO
Classification: Unclassified
Component: ORB (show other bugs)
Version: 1.5.10
Hardware: All Windows XP
: P3 enhancement
Assignee: DOC Center Support List (internal)
URL:
Depends on: 2995
Blocks: 3027 3032 2996 3026
  Show dependency tree
 
Reported: 2007-07-18 06:39 CDT by Johnny Willemsen
Modified: 2007-08-08 05:09 CDT (History)
3 users (show)

See Also:


Attachments
callstack (7.61 KB, application/x-msdownload)
2007-07-18 06:40 CDT, Johnny Willemsen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Johnny Willemsen 2007-07-18 06:39:53 CDT
The Two_DLL_ORB test in ORB_Local_Conf results in a deadlock. This is triggered by the fact that the CSD_Framework now also uses the PI library. There are two threads trying to load the PI library that result in a deadlock:

First ones load the PI through the ORB, second through the CSD Initializer

00445B42 ACE_DLL_Manager::open_dll(this=:00D74CBC, dll_name=:00D95158, open_mode=0, handle=NULL)
00443AD6 ACE_DLL::open_i(this=:00D9502C, dll_filename=:00D95070, open_mode=0, close_handle_on_destruction=1, handle=NULL)
004439E2 ACE_DLL::open(this=:00D9502C, dll_filename=:00D95070, open_mode=0, close_handle_on_destruction=1)
004865EF ACE_Location_Node::open_dll(this=:00D95020, yyerrno=:0149E0C8)
004868ED ACE_Function_Node::symbol(this=:00D95020,  =:00D76750, yyerrno=:0149E0C8, gobbler=:0149E0C4)
00487099 ACE_Service_Type_Factory::make_service_type(this=:00D95080, cfg=:00D76750)
004A27FE ACE_Service_Gestalt::initialize(this=:00D76750, stf=:00D95080, parameters=:00D95104)
0048615A ACE_Dynamic_Node::apply(this=:00D950D0, config=:00D76750, yyerrno=:0149E7DC)
004B5D44 ace_yyparse(ace_svc_conf_parameter=:0149E7D4)
004A307B ACE_Service_Gestalt::process_directives_i(this=:00D76750, param=:0149E7D4)
004A3458 ACE_Service_Gestalt::process_directive(this=:00D76750, directive=:0063A984)
0049E011 ACE_Service_Config::process_directive(directive=:0063A984)
005DD384 PortableInterceptor::register_orb_initializer(init=:00D817BC)
01523EA1 TAO_CSD_Framework_Loader::init()
015012AF TAO_CSD_ThreadPool::init()
01501288 _STCON0_()
3267E2F5 C:\PROGRA~1\Borland\CBUILD~1\Bin\CC3260MT.DLL
3267E813 C:\PROGRA~1\Borland\CBUILD~1\Bin\CC3260MT.DLL
0150113C CSD_Framework
7C91CBAB ntdll.dll
7C916178 ntdll.dll
7C9162DA ntdll.dll
7C801BB9 C:\WINDOWS\system32\kernel32.dll
7C801D6E C:\WINDOWS\system32\kernel32.dll
7C801DA4 C:\WINDOWS\system32\kernel32.dll
004798A8 ACE_OS::dlopen(fname=:00D81C9C, mode=0)
00443EE2 ACE_DLL_Handle::open(this=:00D8FF9C, dll_name=:00D6F948, open_mode=0, handle=NULL)
00445C5E ACE_DLL_Manager::open_dll(this=:00D74CBC, dll_name=:00D6F948, open_mode=0, handle=NULL)
00443AD6 ACE_DLL::open_i(this=:00D8AC54, dll_filename=:00D8AC10, open_mode=0, close_handle_on_destruction=1, handle=NULL)
004439E2 ACE_DLL::open(this=:00D8AC54, dll_filename=:00D8AC10, open_mode=0, close_handle_on_destruction=1)
004865EF ACE_Location_Node::open_dll(this=:00D8AC48, yyerrno=:0149F228)
004868ED ACE_Function_Node::symbol(this=:00D8AC48,  =:00D76750, yyerrno=:0149F228, gobbler=:0149F224)
00487099 ACE_Service_Type_Factory::make_service_type(this=:00D7F3A8, cfg=:00D76750)
004A27FE ACE_Service_Gestalt::initialize(this=:00D76750, stf=:00D7F3A8, parameters=:00D89478)
0048615A ACE_Dynamic_Node::apply(this=:00D6FBF0, config=:00D76750, yyerrno=:0149F970)
004B5D44 ace_yyparse(ace_svc_conf_parameter=:0149F968)
004A307B ACE_Service_Gestalt::process_directives_i(this=:00D76750, param=:0149F968)
004A3251 ACE_Service_Gestalt::process_file(this=:00D76750, file=:00D8ACF4)
004A3F9D ACE_Service_Gestalt::process_directives(this=:00D76750)
004A3929 ACE_Service_Gestalt::open_i(this=:00D76750,  =:00D8ABFC,  =:006494B0,  =false,  =false, ignore_debug_flag=false)
004A41BA ACE_Service_Gestalt::open(this=:00D76750, argc=3, argv=:00D8ABEC, logger_key=:006494B0, ignore_static_svcs=false, ignore_default_svc_conf=false, ignore_debug_flag=false)
005FE5A1 __UNNS__TAO_Internal_01c7c92cb8adf260::open_private_services_i(pcfg=:00D76750, argc=:0149FAB4, argv=:00D8ABEC, skip_service_config_open=false, ignore_default_svc_conf_file=false)
005FE30D TAO::ORB::open_services(pcfg=:00D76750, argc=:0149FDAC, argv=:00D74B84)
005CAC2D CORBA::ORB_init(argc=:0149FDAC, argv=:00D74B84, orbid=NULL)
013816BF Client_Worker::test_main(this=:00D74C74, argc=0, argv=:00D74B84)
01383A25 Service_Config_ORB_DLL::svc(this=:00D76588)
004B8D29 ACE_Task_Base::svc_run(args=:00D76588)
004B93DC ACE_Thread_Adapter::invoke_i(this=:00D7F3A8)
004B92D0 ACE_Thread_Adapter::invoke(this=:00D7F3A8)
00427C55 ace_thread_adapter(args=:00D7F3A8)
3267F32C C:\PROGRA~1\Borland\CBUILD~1\Bin\CC3260MT.DLL
3267F35F C:\PROGRA~1\Borland\CBUILD~1\Bin\CC3260MT.DLL
7C80B683 C:\WINDOWS\system32\kernel32.dll




00443D40 ACE_DLL_Handle::open(this=:00D6FCCC, dll_name=:00D9B3B4, open_mode=0, handle=NULL)
00445C5E ACE_DLL_Manager::open_dll(this=:00D74CBC, dll_name=:00D9B3B4, open_mode=0, handle=NULL)
00443AD6 ACE_DLL::open_i(this=:00D9B288, dll_filename=:00D9B2CC, open_mode=0, close_handle_on_destruction=1, handle=NULL)
004439E2 ACE_DLL::open(this=:00D9B288, dll_filename=:00D9B2CC, open_mode=0, close_handle_on_destruction=1)
004865EF ACE_Location_Node::open_dll(this=:00D9B27C, yyerrno=:0149E0C8)
004868ED ACE_Function_Node::symbol(this=:00D9B27C,  =:00D88B1C, yyerrno=:0149E0C8, gobbler=:0149E0C4)
00487099 ACE_Service_Type_Factory::make_service_type(this=:00D9B2DC, cfg=:00D88B1C)
004A27FE ACE_Service_Gestalt::initialize(this=:00D88B1C, stf=:00D9B2DC, parameters=:00D9B360)
0048615A ACE_Dynamic_Node::apply(this=:00D9B32C, config=:00D88B1C, yyerrno=:0149E7DC)
004B5D44 ace_yyparse(ace_svc_conf_parameter=:0149E7D4)
004A307B ACE_Service_Gestalt::process_directives_i(this=:00D88B1C, param=:0149E7D4)
004A3458 ACE_Service_Gestalt::process_directive(this=:00D88B1C, directive=:0063A984)
0049E011 ACE_Service_Config::process_directive(directive=:0063A984)
005DD384 PortableInterceptor::register_orb_initializer(init=:00D96238)
01523EA1 TAO_CSD_Framework_Loader::init()
015012AF TAO_CSD_ThreadPool::init()
01501288 _STCON0_()
3267E2F5 C:\PROGRA~1\Borland\CBUILD~1\Bin\CC3260MT.DLL
3267E813 C:\PROGRA~1\Borland\CBUILD~1\Bin\CC3260MT.DLL
0150113C CSD_Framework
7C91CBAB ntdll.dll
7C916178 ntdll.dll
7C9162DA ntdll.dll
7C801BB9 C:\WINDOWS\system32\kernel32.dll
7C801D6E C:\WINDOWS\system32\kernel32.dll
7C801DA4 C:\WINDOWS\system32\kernel32.dll
004798A8 ACE_OS::dlopen(fname=:00D942E8, mode=0)
00443EE2 ACE_DLL_Handle::open(this=:00D940F8, dll_name=:00D940E0, open_mode=0, handle=NULL)
00445C5E ACE_DLL_Manager::open_dll(this=:00D74CBC, dll_name=:00D940E0, open_mode=0, handle=NULL)
00443AD6 ACE_DLL::open_i(this=:00D93F94, dll_filename=:00D93FDC, open_mode=0, close_handle_on_destruction=1, handle=NULL)
004439E2 ACE_DLL::open(this=:00D93F94, dll_filename=:00D93FDC, open_mode=0, close_handle_on_destruction=1)
004865EF ACE_Location_Node::open_dll(this=:00D93F88, yyerrno=:0149F228)
004868ED ACE_Function_Node::symbol(this=:00D93F88,  =:00D88B1C, yyerrno=:0149F228, gobbler=:0149F224)
00487099 ACE_Service_Type_Factory::make_service_type(this=:00D93FF4, cfg=:00D88B1C)
004A27FE ACE_Service_Gestalt::initialize(this=:00D88B1C, stf=:00D93FF4, parameters=:00D94080)
0048615A ACE_Dynamic_Node::apply(this=:00D94048, config=:00D88B1C, yyerrno=:0149F970)
004B5D44 ace_yyparse(ace_svc_conf_parameter=:0149F968)
004A307B ACE_Service_Gestalt::process_directives_i(this=:00D88B1C, param=:0149F968)
004A3251 ACE_Service_Gestalt::process_file(this=:00D88B1C, file=:00D6FE94)
004A3F9D ACE_Service_Gestalt::process_directives(this=:00D88B1C)
004A3929 ACE_Service_Gestalt::open_i(this=:00D88B1C,  =:00D6FDB8,  =:006494B0,  =false,  =false, ignore_debug_flag=false)
004A41BA ACE_Service_Gestalt::open(this=:00D88B1C, argc=3, argv=:00D80FA8, logger_key=:006494B0, ignore_static_svcs=false, ignore_default_svc_conf=false, ignore_debug_flag=false)
005FE5A1 __UNNS__TAO_Internal_01c7c92cb8adf260::open_private_services_i(pcfg=:00D88B1C, argc=:0149FAB4, argv=:00D80FA8, skip_service_config_open=false, ignore_default_svc_conf_file=false)
005FE30D TAO::ORB::open_services(pcfg=:00D88B1C, argc=:0149FDAC, argv=:00D8122C)
005CAC2D CORBA::ORB_init(argc=:0149FDAC, argv=:00D8122C, orbid=NULL)
013816BF Client_Worker::test_main(this=:00D811EC, argc=0, argv=:00D8122C)
01383A25 Service_Config_ORB_DLL::svc(this=:00D78798)
004B8D29 ACE_Task_Base::svc_run(args=:00D78798)
004B93DC ACE_Thread_Adapter::invoke_i(this=:00D7874C)
004B92D0 ACE_Thread_Adapter::invoke(this=:00D7874C)
00427C55 ace_thread_adapter(args=:00D7874C)
3267F32C C:\PROGRA~1\Borland\CBUILD~1\Bin\CC3260MT.DLL
3267F35F C:\PROGRA~1\Borland\CBUILD~1\Bin\CC3260MT.DLL
7C80B683 C:\WINDOWS\system32\kernel32.dll
Comment 1 Johnny Willemsen 2007-07-18 06:40:50 CDT
Created attachment 821 [details]
callstack

Callstacks
Comment 2 Johnny Willemsen 2007-07-19 07:10:46 CDT
increased priority and severity
Comment 3 Iliyan Jeliazkov 2007-07-23 07:24:31 CDT
Hi,

Does anybody know why is it necessary to obtain a DLL Handle lock inside symbol()? 

...
void *
ACE_DLL_Handle::symbol (const ACE_TCHAR *sym_name, int ignore_errors)
{
  ACE_TRACE ("ACE_DLL_Handle::symbol");
  ACE_MT (ACE_GUARD_RETURN (ACE_Thread_Mutex, ace_mon, this->lock_, 0));
...

It doesn't look like symbol() is modifying the handle instance, so perhaps locking is not necessary.
Comment 4 Johnny Willemsen 2007-07-23 13:55:46 CDT
Is there the possibility that someone unloads the DLL by hand while another thread is trying to use symbol?

(In reply to comment #3)
> Hi,
> 
> Does anybody know why is it necessary to obtain a DLL Handle lock inside
> symbol()? 
> 
> ...
> void *
> ACE_DLL_Handle::symbol (const ACE_TCHAR *sym_name, int ignore_errors)
> {
>   ACE_TRACE ("ACE_DLL_Handle::symbol");
>   ACE_MT (ACE_GUARD_RETURN (ACE_Thread_Mutex, ace_mon, this->lock_, 0));
> ...
> 
> It doesn't look like symbol() is modifying the handle instance, so perhaps
> locking is not necessary.
> 

Comment 5 Iliyan Jeliazkov 2007-07-24 00:20:31 CDT
(In reply to comment #4)
> Is there the possibility that someone unloads the DLL by hand while another
> thread is trying to use symbol?

No, I don't think so: the ACE_DLL_Handle's are reference counted so the last one to use one (and close it) causes it to disappear.


BTW, can you still reproduce the lock-up? I did a clean build and have been unsuccessfully struggling for the rest of the afternoon to reproduce it.  Please, attach your config.h and any default.features files.
Comment 6 Johnny Willemsen 2007-07-24 00:24:06 CDT
> No, I don't think so: the ACE_DLL_Handle's are reference counted so the last
> one to use one (and close it) causes it to disappear.

Ok

> BTW, can you still reproduce the lock-up? I did a clean build and have been
> unsuccessfully struggling for the rest of the afternoon to reproduce it. 
> Please, attach your config.h and any default.features files.

I had it with borland, normal config.h file. Also the test stats show that several builds have problems. I will rebuild later today and rerun
Comment 7 Iliyan Jeliazkov 2007-07-24 08:04:13 CDT
(In reply to comment #6)
> > BTW, can you still reproduce the lock-up? I did a clean build and have been
> > unsuccessfully struggling for the rest of the afternoon to reproduce it. 
> > Please, attach your config.h and any default.features files.
> 
> I had it with borland, normal config.h file. Also the test stats show that
> several builds have problems. I will rebuild later today and rerun

Unfortunately, I don't have a borland compiler ...

I'll take a look at the test stats. Any particular test to watch for?

Also, I think you mentioned a problem, you thought was related - something about the order of initialization/finalization of DiffServ.dll and PI.dll (I think).  Can you point me to the test in question?

Thanks,

--Iliyan
Comment 8 Johnny Willemsen 2007-07-24 08:38:43 CDT
See TAO/tests/ORB_Local_Config/Two_DLL_ORB

> Also, I think you mentioned a problem, you thought was related - something
> about the order of initialization/finalization of DiffServ.dll and PI.dll (I
> think).  Can you point me to the test in question?

There is no test about that one, will have to ask the reporter to make one
Comment 9 Johnny Willemsen 2007-07-24 10:00:24 CDT
Just did another test run. Still fails on svn head. When removing the lock from the symbol method the test runs perfect. 

Steve, do you see any risk of removing the lock from the symbol method?
Comment 10 Steve Huston 2007-07-24 15:33:55 CDT
With respect to the symbol() method, I could maybe see obtaining the lock, taking a copy of the handle (via get_handle()), release the lock, and check for ACE_INVALID_HANDLE then proceed.

However, it seems that the issue here may instead be holding the lock across the dlopen() call in ACE_DLL_Handle::open() - if a DLL's initialization tries to load another DLL, it would deadlock. I didn't analyze all the possibilities here, but it may be necessary to temporarily release the lock around the dlopen() call.
Comment 11 Iliyan Jeliazkov 2007-07-25 00:33:37 CDT
(In reply to comment #10)
> With respect to the symbol() method, I could maybe see obtaining the lock,
> taking a copy of the handle (via get_handle()), release the lock, and check
> for ACE_INVALID_HANDLE then proceed.
> 
> However, it seems that the issue here may instead be holding the lock across
> the dlopen() call in ACE_DLL_Handle::open() - if a DLL's initialization tries
> to load another DLL, it would deadlock. I didn't analyze all the possibilities
> here, but it may be necessary to temporarily release the lock around the
> dlopen() call.


The ACE_DLL_Handle::lock_ is an instance member and the guard in open() grabs it upon entry, so DLL_Handle will not deadlock just because one DLL causes another ::open() call.

Also, since the ACE_Thread_Mutex is recursive (isn't it? at least on windows?) - if ACE_DLL_Handle::open () re-enters, it will not deadlock. (But if ACE_Thread_Mutex is not recursive, it will!)


It is however, conceivable to have a thread in DLL1::open() -> ... DLL2::open() and another in DLL2:open() -> ... -> DLL1::open(). Thus T1 would wait for DLL2, while T2 will wait for DLL1, which would be a typical deadlock.

The other place this lock is referenced from is symbol(), so therefore one can use either open() or system().

I agree with Steve about a possible approach to resolving this issue. Perhaps, a reverse lock, combined with re-checking the this->handle_ before we assign to it...
Comment 12 Abdul Sowayan 2007-07-25 08:02:38 CDT
>Also, since the ACE_Thread_Mutex is recursive (isn't it? 
>at least on windows?) - if ACE_DLL_Handle::open () 
>re-enters, it will not deadlock. (But if
>ACE_Thread_Mutex is not recursive, it will!)

No, it isn't recursive on all platforms. If you want such behavior, use ACE_Recursive_Thread_Mutex

Thanks,
Abdul
Comment 13 Iliyan Jeliazkov 2007-07-25 12:54:02 CDT
(In reply to comment #12)
> >Also, since the ACE_Thread_Mutex is recursive (isn't it? 
> >at least on windows?) - if ACE_DLL_Handle::open () 
> >re-enters, it will not deadlock. (But if
> >ACE_Thread_Mutex is not recursive, it will!)
> 
> No, it isn't recursive on all platforms. If you want such behavior, use
> ACE_Recursive_Thread_Mutex


Thanks, I was suspecting that much although I don't know a whole lot about ACE_DLL and the related stuff ...


Johnny, Steve,

It looks that apart from having developed a test (Two_DLL_ORB) that trips this problem, I can't provide much insight into this issue.  My knowledge and expertise into the workings of ACE_DLL_Handle is rather limited.  In the absence of other takers, I can try to tackle it but it may not be until the end of the week when I get a chance to work on it.  Please, consider re-assigning it.

--Iliyan
Comment 14 Iliyan Jeliazkov 2007-07-29 23:58:26 CDT
Okay, fine :) ... What I see in the debugger when the test deadlocks doesn't make sense to me. Perhaps an Windows expert can help? 

As Johnny noted: T1 loads PI through the ORB and T2 - through the CSD (ORB) Initializer. My observation is slightly different in that T1 is in DLL_Handle::symbol() instead of open(). Similar to Johnny's T2 is in DLL_Handle::open():


T1's call stack (abbreviated):
...
 	ACEd.dll!ACE_OS::thread_mutex_unlock(_RTL_CRITICAL_SECTION * m=0x00ed077c)  Line 3662 + 0xc bytes	C++
 	ACEd.dll!ACE_Thread_Mutex::release()  Line 81 + 0x9 bytes	C++
 	010af368()	
 	ACEd.dll!ACE_DLL_Handle::open(const char * dll_name=0x01220000, int open_mode=17494632, HINSTANCE__ * handle=0x00000000)  Line 239 + 0xf bytes	C++
 	ntdll.dll!7c919ba0() 	
 	kernel32.dll!7c80adde() 	
 	ACEd.dll!ACE_DLL_Handle::symbol(const char * sym_name=0x00eb8c90, int ignore_errors=0)  Line 329 + 0x15 bytes	C++
 	ACEd.dll!ACE_DLL::symbol(const char * sym_name=0x00eb8c90, int ignore_errors=0)  Line 186 + 0x13 bytes	C++
 	ACEd.dll!ACE_Function_Node::symbol(ACE_Service_Gestalt * __formal=0x00ec0938, int & yyerrno=0, void (void *)* * gobbler=0x010af39c)  Line 597 + 0x11 bytes	C++
 	ACEd.dll!ACE_Service_Type_Factory::make_service_type(ACE_Service_Gestalt * cfg=0x00ec0938)  Line 796 + 0x29 bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::initialize(const ACE_Service_Type_Factory * stf=0x00ebb5a8, const char * parameters=0x00eb8d78)  Line 583 + 0xc bytes	C++
 	ACEd.dll!ACE_Dynamic_Node::apply(ACE_Service_Gestalt * config=0x00ec0938, int & yyerrno=0)  Line 264 + 0x1d bytes	C++
 	ACEd.dll!ace_yyparse(void * ace_svc_conf_parameter=0x010afb24)  Line 1126 + 0x27 bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::process_directives_i(ACE_Svc_Conf_Param * param=0x010afb24)  Line 825 + 0x9 bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::process_directive(const char * directive=0x006d9760)  Line 960 + 0xc bytes	C++
 	TAOd.dll!TAO_ORB_Core::orbinitializer_registry_i()  Line 1627 + 0x10 bytes	C++
 	TAOd.dll!TAO_ORB_Core::orbinitializer_registry()  Line 440 + 0x8 bytes	C++
 	TAOd.dll!CORBA::ORB_init(int & argc=0, char * * argv=0x00ebd830, const char * orbid=0x00000000)  Line 1320 + 0xf bytes	C++
 	ORB_DLL_Serverd.dll!Server_Worker::test_main(int argc=0, char * * argv=0x00ebd830)  Line 63 + 0x30 bytes	C++
...



T2's call stack (abbreviated):
...
 	ACEd.dll!ACE_OS::thread_mutex_lock(_RTL_CRITICAL_SECTION * m=0x00ed077c)  Line 3581 + 0xc bytes	C++
 	ACEd.dll!ACE_Thread_Mutex::acquire()  Line 53 + 0x9 bytes	C++
 	ACEd.dll!ACE_Guard<ACE_Thread_Mutex>::acquire()  Line 12 + 0xa bytes	C++
 	ACEd.dll!ACE_Guard<ACE_Thread_Mutex>::ACE_Guard<ACE_Thread_Mutex>(ACE_Thread_Mutex & l={...})  Line 39	C++
	ACEd.dll!ACE_DLL_Handle::open(const char * dll_name=0x00edfc48, int open_mode=0, HINSTANCE__ * handle=0x00000000)  Line 56 + 0xf bytes	C++
 	ACEd.dll!ACE_DLL_Manager::open_dll(const char * dll_name=0x00edfc48, int open_mode=0, HINSTANCE__ * handle=0x00000000)  Line 590 + 0x14 bytes	C++
...
TAOd.dll!PortableInterceptor::register_orb_initializer(PortableInterceptor::ORBInitializer * init=0x00eda83c)  Line 60 + 0xd bytes	C++
 	TAO_CSD_Frameworkd.dll!TAO_CSD_Framework_Loader::init()  Line 46 + 0x11 bytes	C++
...
 	ACEd.dll!ACE_OS::dlopen(const char * fname=0x00ed0db0, int mode=0)  Line 157 + 0xc bytes	C++
 	ACEd.dll!ACE_DLL_Handle::open(const char * dll_name=0x00eb9190, int open_mode=0, HINSTANCE__ * handle=0x00000000)  Line 131 + 0x12 bytes	C++
 	ACEd.dll!ACE_DLL_Manager::open_dll(const char * dll_name=0x00eb9190, int open_mode=0, HINSTANCE__ * handle=0x00000000)  Line 590 + 0x14 bytes	C++
 	ACEd.dll!ACE_DLL::open_i(const char * dll_filename=0x00eb8ed8, int open_mode=0, int close_handle_on_destruction=1, HINSTANCE__ * handle=0x00000000)  Line 167 + 0x25 bytes	C++
 	ACEd.dll!ACE_DLL::open(const char * dll_filename=0x00eb8ed8, int open_mode=0, int close_handle_on_destruction=1)  Line 128	C++
 	ACEd.dll!ACE_Location_Node::open_dll(int & yyerrno=0)  Line 416 + 0x18 bytes	C++
 	ACEd.dll!ACE_Function_Node::symbol(ACE_Service_Gestalt * __formal=0x00ebea80, int & yyerrno=0, void (void *)* * gobbler=0x011ef124)  Line 588 + 0xc bytes	C++
 	ACEd.dll!ACE_Service_Type_Factory::make_service_type(ACE_Service_Gestalt * cfg=0x00ebea80)  Line 796 + 0x29 bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::initialize(const ACE_Service_Type_Factory * stf=0x00eb8f28, const char * parameters=0x00eb9088)  Line 583 + 0xc bytes	C++
 	ACEd.dll!ACE_Dynamic_Node::apply(ACE_Service_Gestalt * config=0x00ebea80, int & yyerrno=0)  Line 264 + 0x1d bytes	C++
 	ACEd.dll!ace_yyparse(void * ace_svc_conf_parameter=0x011ef8b0)  Line 1126 + 0x27 bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::process_directives_i(ACE_Svc_Conf_Param * param=0x011ef8b0)  Line 825 + 0x9 bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::process_file(const char * file=0x00ebb4b8)  Line 924 + 0xf bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::process_directives()  Line 1211 + 0x11 bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::open_i(const char * __formal=0x00eb8c40, const char * __formal=0x00eb8c40, const char * __formal=0x00eb8c40, const char * __formal=0x00eb8c40, bool ignore_debug_flag=false)  Line 1065 + 0x8 bytes	C++
 	ACEd.dll!ACE_Service_Gestalt::open(int argc=3, char * * argv=0x00eb8bf4, const char * logger_key=0x006e9b10, bool ignore_static_svcs=false, bool ignore_default_svc_conf=false, bool ignore_debug_flag=false)  Line 48 + 0x3a bytes	C++
 	TAOd.dll!`anonymous namespace'::open_private_services_i(ACE_Service_Gestalt * pcfg=0x00ebea80, int & argc=3, char * * argv=0x00eb8bf4, bool skip_service_config_open=false, bool ignore_default_svc_conf_file=false)  Line 460 + 0x40 bytes	C++
 	TAOd.dll!TAO::ORB::open_services(ACE_Service_Gestalt * pcfg=0x00ebea80, int & argc=0, char * * argv=0x00ebe638)  Line 389 + 0x21 bytes	C++
 	TAOd.dll!CORBA::ORB_init(int & argc=0, char * * argv=0x00ebe638, const char * orbid=0x00000000)  Line 1303 + 0x3b bytes	C++
...



Looks like T1 is unlocking the same lock that T2 is trying to obtain (_RTL_CRITICAL_SECTION * m=0x00ed077c). 

Why would the two deadlock?


The stack above ::symbol() looks fishy, because it is apparently still in ACE_OS::dlsym() and should therefore still hold the lock.  This explains why T2 is waiting to obtain it.  But it doesn't explain why T1 doesn't return from ACE_OS::dlsym() (i.e. GetProcAddress())


Hmm, the process explorer from sysinternals.com shows the following stack trace for T1:

...
ntoskrnl.exe!ExReleaseResourceLite+0x206
ntoskrnl.exe!RtlAppendUnicodeToString+0x2b8
ntoskrnl.exe!CcSetDirtyPinnedData+0x3f0
ntoskrnl.exe!NtWaitForSingleObject+0x94
ntoskrnl.exe!ZwYieldExecution+0xb78
ntdll.dll!KiFastSystemCallRet
ntdll.dll!RtlEnterCriticalSection+0x46
ntdll.dll!LdrGetProcedureAddress+0x18
!GetProcAddress+0x3e
!ACE_OS::dlsym+0x2b
!ACE_DLL_Handle::symbol+0xb7
!ACE_DLL::symbol+0x44
!ACE_Function_Node::symbol+0x54
!ACE_Service_Type_Factory::make_service_type+0x90
!ACE_Service_Gestalt::initialize+0x1bc
!ACE_Dynamic_Node::apply+0x3b
!ace_yyparse+0x303
!ACE_Service_Gestalt::process_directives_i+0xf6
!ACE_Service_Gestalt::process_directive+0xb3
!TAO_ORB_Core::orbinitializer_registry_i+0x59
!TAO_ORB_Core::orbinitializer_registry+0x98
!CORBA::ORB_init+0x67c
!Server_Worker::test_main+0xba
!Service_Config_ORB_DLL::svc+0x19f
!ACE_Task_Base::svc_run+0x47
!ACE_Thread_Adapter::invoke_i+0xdd
!ACE_Thread_Adapter::invoke+0xc3
!ace_thread_adapter+0x2a
!beginthreadex+0x221
!beginthreadex+0x1c7
!GetModuleFileNameA+0x1b4
...



And for T2:

...
ntoskrnl.exe!ExReleaseResourceLite+0x206
ntoskrnl.exe!RtlAppendUnicodeToString+0x2b8
ntoskrnl.exe!CcSetDirtyPinnedData+0x3f0
ntoskrnl.exe!NtWaitForSingleObject+0x94
ntoskrnl.exe!ZwYieldExecution+0xb78
ntdll.dll!KiFastSystemCallRet
ntdll.dll!RtlEnterCriticalSection+0x46
!ACE_Thread_Mutex::acquire+0x17
!ACE_Guard<ACE_Thread_Mutex>::acquire+0x18
!ACE_Guard<ACE_Thread_Mutex>::ACE_Guard<ACE_Thread_Mutex>+0x28
!ACE_DLL_Handle::open+0x50
!ACE_DLL_Manager::open_dll+0x1be
!ACE_DLL::open_i+0x11e
!ACE_DLL::open+0x24
!ACE_Location_Node::open_dll+0x8f
!ACE_Function_Node::symbol+0x28
!ACE_Service_Type_Factory::make_service_type+0x90
!ACE_Service_Gestalt::initialize+0x1bc
!ACE_Dynamic_Node::apply+0x3b
!ace_yyparse+0x303
!ACE_Service_Gestalt::process_directives_i+0xf6
!ACE_Service_Gestalt::process_directive+0xb3
!ACE_Service_Config::process_directive+0x13
!PortableInterceptor::register_orb_initializer+0x13a
!TAO_CSD_Framework_Loader::init+0x15e
!TAO_CSD_ThreadPool::init+0x23
!`dynamic initializer for 'TAO_Requires_CSD_Threadpool''+0x8
!initterm+0x1a
!_CRT_INIT+0xdf
!__DllMainCRTStartup+0xac
!_DllMainCRTStartup+0x1f
ntdll.dll!LdrInitializeThunk+0x29
ntdll.dll!LdrFindResourceDirectory_U+0x276
ntdll.dll!RtlValidateUnicodeString+0x506
ntdll.dll!LdrLoadDll+0x110
!LoadLibraryExW+0xc8
!LoadLibraryExA+0x1f
!LoadLibraryA+0x2d
!ACE_OS::dlopen+0x18
!ACE_DLL_Handle::open+0x1cc
!ACE_DLL_Manager::open_dll+0x1be
!ACE_DLL::open_i+0x11e
!ACE_DLL::open+0x24
!ACE_Location_Node::open_dll+0x8f
!ACE_Function_Node::symbol+0x28
!ACE_Service_Type_Factory::make_service_type+0x90
!ACE_Service_Gestalt::initialize+0x1bc
!ACE_Dynamic_Node::apply+0x3b
!ace_yyparse+0x303
!ACE_Service_Gestalt::process_directives_i+0xf6
!ACE_Service_Gestalt::process_file+0x1a2
!ACE_Service_Gestalt::process_directives+0x6e
!ACE_Service_Gestalt::open_i+0x175
!ACE_Service_Gestalt::open+0x85
!`anonymous namespace'::open_private_services_i+0xab
!TAO::ORB::open_services+0xc26
!CORBA::ORB_init+0x593
!Client_Worker::test_main+0x97
!Service_Config_ORB_DLL::svc+0x19f
!ACE_Task_Base::svc_run+0x47
!ACE_Thread_Adapter::invoke_i+0xdd
!ACE_Thread_Adapter::invoke+0xc3
!ace_thread_adapter+0x2a
!beginthreadex+0x221
!beginthreadex+0x1c7
!GetModuleFileNameA+0x1b4
...

Does this make more sense to anybody?


--Iliyan Jeliazkov
Comment 15 Adam Mitz 2007-07-31 13:31:51 CDT
(In reply to comment #14)
> Okay, fine :) ... What I see in the debugger when the test deadlocks doesn't
> make sense to me. Perhaps an Windows expert can help? 

I won't claim to be an expert, but how about attaching a system debugger and running
!critsec ed077c
to see what it tells you about the critical section?  Maybe it's not even valid anymore?  If it is valid this command will show you if it's locked or not, the owning thread, etc.

-Adam

Comment 16 Iliyan Jeliazkov 2007-08-02 22:52:13 CDT
Mine
Comment 17 Iliyan Jeliazkov 2007-08-03 00:12:15 CDT
I think I figured it out why Two_DLL_ORB test is deadlocking! It has _nothing_ to do with the ORB configuration, but nevertheless the issue can only be demonstrated with one of the ORB_Local_Config tests:

The DLL-related system primitives on Windows (GetProcedureAddress, LoadLibrary, etc.) obtain lock(s) on their own, which can happen out-of-order with ACE_DLL_Handle's locking.

If I could refer to these locks: L1 (internal) and L2 (ACE_DLL_Handle) respectively, what happens is:

  T1
    loads TAO_CSD_ThreadPool ...
    loads TAO_PI
    call ACE_DLL_Handle::open()
    lock L2
    call ::GetProcedureAddress
    lock L1


  T2
    call ::LoadLibrary
    lock L1
    loads TAO_PI
    call ACE_DLL_Handle::symbol()
    lock L2 



What can we do? 

Perhaps we can not hold the lock (L2) in symbol? But then it will make it possible for symbol() to fail just because another thread is in the middle of opening the DLL. Perhaps, not on Windows? The OS seems to be protected from retrieving a symbol of a library that's still loading, but is ACE?

(I dare not enduring similar reverse engineering session on Linux. And on Solaris. And on ... A-a-a-a-a! :)

Maybe we can employ a sort of double-checked-locking pattern in symbol(): check the handle_ first and if it is not okay the get the lock_ and check again?
  

I'm interested in getting opinions.





Here's the gory details for those who'd like to check my sanity (please!):


Stack trace for T1 (0xc74)                        (Warning: LONG!):

011edb20 7c90e9c0 ntdll!KiFastSystemCallRet
011edb24 7c91901b ntdll!ZwWaitForSingleObject+0xc
011edbac 7c90104b ntdll!RtlpWaitForCriticalSection+0x132
011edbb4 100d8c50 ntdll!RtlEnterCriticalSection+0x46
011edbc4 1014c207 ACEd!ACE_OS::thread_mutex_lock(struct _RTL_CRITICAL_SECTION * m = 0x00ec923c)+0x10
011edbd4 10025a38 ACEd!ACE_Thread_Mutex::acquire(void)+0x17
011edbe0 10025898 ACEd!ACE_Guard<ACE_Thread_Mutex>::acquire(void)+0x18
011edbec 100770d0 ACEd!ACE_Guard<ACE_Thread_Mutex>::ACE_Guard<ACE_Thread_Mutex>(class ACE_Thread_Mutex * l = 0x00ec923c)+0x28
011edcb8 100789fe ACEd!ACE_DLL_Handle::open(char * dll_name = 0x00edac70 "TAO_PI", int open_mode = 0, struct HINSTANCE__ * handle = 0x00000000)+0x50
011edd48 10076a5e ACEd!ACE_DLL_Manager::open_dll(char * dll_name = 0x00edac70 "TAO_PI", int open_mode = 0, struct HINSTANCE__ * handle = 0x00000000)+0x1be
011edd6c 10076924 ACEd!ACE_DLL::open_i(char * dll_filename = 0x00edaa80 "TAO_PI", int open_mode = 0, int close_handle_on_destruction = 1, struct HINSTANCE__ * handle = 0x00000000)+0x11e
011edd88 100e7d1f ACEd!ACE_DLL::open(char * dll_filename = 0x00edaa80 "TAO_PI", int open_mode = 0, int close_handle_on_destruction = 1)+0x24
011eddb8 100e8148 ACEd!ACE_Location_Node::open_dll(int * yyerrno = 0x011ede4c)+0x8f
011eddf8 100e8a40 ACEd!ACE_Function_Node::symbol(class ACE_Service_Gestalt * __formal = 0x00eb9d20, int * yyerrno = 0x011ede4c, <function> ** gobbler = 0x011ede40)+0x28
011ede68 10118e2c ACEd!ACE_Service_Type_Factory::make_service_type(class ACE_Service_Gestalt * cfg = 0x00eb9d20)+0x90
011edee8 100e773b ACEd!ACE_Service_Gestalt::initialize(class ACE_Service_Type_Factory * stf = 0x00edaab8, char * parameters = 0x00eca860 "")+0x1bc
011edf04 1013b233 ACEd!ACE_Dynamic_Node::apply(class ACE_Service_Gestalt * config = 0x00eb9d20, int * yyerrno = 0x011ee5d0)+0x3b
011ee55c 10119726 ACEd!ace_yyparse(void * ace_svc_conf_parameter = 0x011ee5c8)+0x303
011ee5ac 10119b23 ACEd!ACE_Service_Gestalt::process_directives_i(class ACE_Svc_Conf_Param * param = 0x011ee5c8)+0xf6
011ee60c 10114343 ACEd!ACE_Service_Gestalt::process_directive(char * directive = 0x006dbe60 "dynamic ORBInitializer_Registry Service_Object * TAO_PI:_make_ORBInitializer_Registry() """)+0xb3
011ee618 00636a0a ACEd!ACE_Service_Config::process_directive(char * directive = 0x006dbe60 "dynamic ORBInitializer_Registry Service_Object * TAO_PI:_make_ORBInitializer_Registry() """)+0x13
011ee678 01302dae TAOd!PortableInterceptor::register_orb_initializer(class PortableInterceptor::ORBInitializer * init = 0x00ed5924)+0x13a
011ee6ec 012c1f13 TAO_CSD_Frameworkd!TAO_CSD_Framework_Loader::init(void)+0x15e
011ee6f8 012d34b8 TAO_CSD_ThreadPoold!TAO_CSD_ThreadPool::init(void)+0x23
011ee700 0042246a TAO_CSD_ThreadPoold!`dynamic initializer for 'TAO_Requires_CSD_Threadpool''(void)+0x8
011ee708 012cdf8f MSVCR80D!_initterm+0x1a
011ee730 012ce1dc TAO_CSD_ThreadPoold!_CRT_INIT(void * hDllHandle = 0x012c0000, unsigned long dwReason = 1, void * lpreserved = 0x00000000)+0xdf
011ee774 012ce11f TAO_CSD_ThreadPoold!__DllMainCRTStartup(void * hDllHandle = 0x012c0000, unsigned long dwReason = 1, void * lpreserved = 0x00000000)+0xac
011ee788 7c9011a7 TAO_CSD_ThreadPoold!_DllMainCRTStartup(void * hDllHandle = 0x012c0000, unsigned long dwReason = 1, void * lpreserved = 0x00000000)+0x1f
011ee7a8 7c91cbab ntdll!LdrpCallInitRoutine+0x14
011ee8b0 7c916178 ntdll!LdrpRunInitializeRoutines+0x344
011eeb5c 7c9162da ntdll!LdrpLoadDll+0x3e5
011eee04 7c801bb9 ntdll!LdrLoadDll+0x230
011eee6c 7c801d6e kernel32!LoadLibraryExW+0x18e
011eee80 7c801da4 kernel32!LoadLibraryExA+0x1f
011eee9c 100cc748 kernel32!LoadLibraryA+0x94
011eeeb0 1007724c ACEd!ACE_OS::dlopen(char * fname = 0x00ed0f58 "TAO_CSD_ThreadPoold.dll", int mode = 0)+0x18
011eef80 100789fe ACEd!ACE_DLL_Handle::open(char * dll_name = 0x00ed0c48 "TAO_CSD_ThreadPool", int open_mode = 0, struct HINSTANCE__ * handle = 0x00000000)+0x1cc
011ef010 10076a5e ACEd!ACE_DLL_Manager::open_dll(char * dll_name = 0x00ed0c48 "TAO_CSD_ThreadPool", int open_mode = 0, struct HINSTANCE__ * handle = 0x00000000)+0x1be
011ef034 10076924 ACEd!ACE_DLL::open_i(char * dll_filename = 0x00ed0a10 "TAO_CSD_ThreadPool", int open_mode = 0, int close_handle_on_destruction = 1, struct HINSTANCE__ * handle = 0x00000000)+0x11e
011ef050 100e7d1f ACEd!ACE_DLL::open(char * dll_filename = 0x00ed0a10 "TAO_CSD_ThreadPool", int open_mode = 0, int close_handle_on_destruction = 1)+0x24
011ef080 100e8148 ACEd!ACE_Location_Node::open_dll(int * yyerrno = 0x011ef114)+0x8f
011ef0c0 100e8a40 ACEd!ACE_Function_Node::symbol(class ACE_Service_Gestalt * __formal = 0x00eb9d20, int * yyerrno = 0x011ef114, <function> ** gobbler = 0x011ef108)+0x28
011ef130 10118e2c ACEd!ACE_Service_Type_Factory::make_service_type(class ACE_Service_Gestalt * cfg = 0x00eb9d20)+0x90
011ef1b0 100e773b ACEd!ACE_Service_Gestalt::initialize(class ACE_Service_Type_Factory * stf = 0x00ed0a50, char * parameters = 0x00ed0b70 "-CSDtp RootPOA:2")+0x1bc
011ef1cc 1013b233 ACEd!ACE_Dynamic_Node::apply(class ACE_Service_Gestalt * config = 0x00eb9d20, int * yyerrno = 0x011ef89c)+0x3b
011ef824 10119726 ACEd!ace_yyparse(void * ace_svc_conf_parameter = 0x011ef894)+0x303
011ef874 101199b2 ACEd!ACE_Service_Gestalt::process_directives_i(class ACE_Svc_Conf_Param * param = 0x011ef894)+0xf6
011ef940 1011a5be ACEd!ACE_Service_Gestalt::process_file(char * file = 0x00eca890 "secondary-csd.conf")+0x1a2
011ef978 10119e65 ACEd!ACE_Service_Gestalt::process_directives(void)+0x6e
011ef9c8 1011a935 ACEd!ACE_Service_Gestalt::open_i(char * __formal = 0x00eca6e0 "-ORBDebugLevel", char * __formal = 0x006e9aa8 "localhost:20012", bool __formal = false, bool __formal = false, bool ignore_debug_flag = false)+0x175
011ef9f0 00667c2b ACEd!ACE_Service_Gestalt::open(int argc = 3, char ** argv = 0x00eca6a4, char * logger_key = 0x006e9aa8 "localhost:20012", bool ignore_static_svcs = false, bool ignore_default_svc_conf = false, bool ignore_debug_flag = false)+0x85
011efa3c 006677a6 TAOd!`anonymous namespace'::open_private_services_i(class ACE_Service_Gestalt * pcfg = 0x00eb9d20, int * argc = 0x011efac8, char ** argv = 0x00eca6a4, bool skip_service_config_open = false, bool ignore_default_svc_conf_file = false)+0xab
011efbdc 0062076c TAOd!TAO::ORB::open_services(class ACE_Service_Gestalt * pcfg = 0x00eb9d20, int * argc = 0x011efe3c, char ** argv = 0x00ec23a8)+0xc26
011efd58 010b1e57 TAOd!CORBA::ORB_init(int * argc = 0x011efe3c, char ** argv = 0x00ec23a8, char * orbid = 0x00000000 "")+0x5dc
011efe34 010b356f ORB_DLL_Clientd!Client_Worker::test_main(int argc = 0, char ** argv = 0x00ec23a8)+0x97
011efea0 101402d7 ORB_DLL_Clientd!Service_Config_ORB_DLL::svc(void)+0x19f
011efebc 101413ad ACEd!ACE_Task_Base::svc_run(void * args = 0x00ebb528)+0x47
011eff18 10141263 ACEd!ACE_Thread_Adapter::invoke_i(void)+0xdd
011eff58 1004940a ACEd!ACE_Thread_Adapter::invoke(void)+0xc3


Stack trace for T2 (0x144c):

010af0cc 7c90e9c0 ntdll!KiFastSystemCallRet
010af0d0 7c91901b ntdll!ZwWaitForSingleObject+0xc
010af158 7c90104b ntdll!RtlpWaitForCriticalSection+0x132
010af160 7c919b30 ntdll!RtlEnterCriticalSection+0x46
010af210 7c919ba0 ntdll!LdrpGetProcedureAddress+0x117
010af22c 7c80adde ntdll!LdrGetProcedureAddress+0x18
010af254 100cc7ab kernel32!GetProcAddress+0x43
010af270 100779d7 ACEd!ACE_OS::dlsym(struct HINSTANCE__ * handle = 0x01220000, char * sname = 0x00ec92c0 "_make_ORBInitializer_Registry")+0x2b
010af2dc 10076ae4 ACEd!ACE_DLL_Handle::symbol(char * sym_name = 0x00ec8f48 "_make_ORBInitializer_Registry", int ignore_errors = 0)+0xb7
010af2f4 100e8174 ACEd!ACE_DLL::symbol(char * sym_name = 0x00ec8f48 "_make_ORBInitializer_Registry", int ignore_errors = 0)+0x44
010af338 100e8a40 ACEd!ACE_Function_Node::symbol(class ACE_Service_Gestalt * __formal = 0x00ebdcf8, int * yyerrno = 0x010af38c, <function> ** gobbler = 0x010af380)+0x54
010af3a8 10118e2c ACEd!ACE_Service_Type_Factory::make_service_type(class ACE_Service_Gestalt * cfg = 0x00ebdcf8)+0x90
010af428 100e773b ACEd!ACE_Service_Gestalt::initialize(class ACE_Service_Type_Factory * stf = 0x00ec8f98, char * parameters = 0x00ec90b8 "")+0x1bc
010af444 1013b233 ACEd!ACE_Dynamic_Node::apply(class ACE_Service_Gestalt * config = 0x00ebdcf8, int * yyerrno = 0x010afb10)+0x3b
010afa9c 10119726 ACEd!ace_yyparse(void * ace_svc_conf_parameter = 0x010afb08)+0x303
010afaec 10119b23 ACEd!ACE_Service_Gestalt::process_directives_i(class ACE_Svc_Conf_Param * param = 0x010afb08)+0xf6
010afb4c 006296a9 ACEd!ACE_Service_Gestalt::process_directive(char * directive = 0x006d96f8 "dynamic ORBInitializer_Registry Service_Object * TAO_PI:_make_ORBInitializer_Registry() """)+0xb3
010afb64 00624fa8 TAOd!TAO_ORB_Core::orbinitializer_registry_i(void)+0x59
010afb9c 00620855 TAOd!TAO_ORB_Core::orbinitializer_registry(void)+0x98
010afd0c 003c568a TAOd!CORBA::ORB_init(int * argc = 0x010afe3c, char ** argv = 0x00eb0ec8, char * orbid = 0x00000000 "")+0x6c5
010afe34 003c356f ORB_DLL_Serverd!Server_Worker::test_main(int argc = 0, char ** argv = 0x00eb0ec8)+0xba
010afea0 101402d7 ORB_DLL_Serverd!Service_Config_ORB_DLL::svc(void)+0x19f
010afebc 101413ad ACEd!ACE_Task_Base::svc_run(void * args = 0x00eb0930)+0x47
010aff18 10141263 ACEd!ACE_Thread_Adapter::invoke_i(void)+0xdd
010aff58 1004940a ACEd!ACE_Thread_Adapter::invoke(void)+0xc3
010aff6c 004248d1 ACEd!ace_thread_adapter(void * args = 0x00ebb288)+0x2a
010affa8 00424877 MSVCR80D!_beginthreadex+0x221
010affb4 7c80b683 MSVCR80D!_beginthreadex+0x1c7
010affec 00000000 kernel32!BaseThreadStart+0x37





Note that the thread (0xc74) in ACE_DLL_Handle::open () is blocked on the following critical section:

0:002> !critsec ec923c

CritSec +ec923c at 00ec923c
LockCount          1
RecursionCount     1
OwningThread       144c
EntryCount         1
ContentionCount    1
*** Locked




The thread (0x144c), in ACE_DLL_Handle::symbol() owns the critical section shown above. What it is trying to do is this:

...
7c919b26 68d8c0977c      push    offset ntdll!LdrpLoaderLock (7c97c0d8)
7c919b2b e8d574feff      call    ntdll!RtlEnterCriticalSection (7c901005)
7c919b30 e915ffffff      jmp     ntdll!LdrpGetProcedureAddress+0x117 (7c919a4a)
7c919b35 68d8c0977c      push    offset ntdll!LdrpLoaderLock (7c97c0d8)
7c919b3a e8ae75feff      call    ntdll!RtlLeaveCriticalSection (7c9010ed)
...

As far as my assembly goes it means that GetProcedureAddress tries to enter a critical section, pointed to by ntdll!LdrpLoaderLock:

0:001> !critsec 7c97c0d8

CritSec ntdll!LdrpLoaderLock+0 at 7c97c0d8
LockCount          3
RecursionCount     2
OwningThread       c74
EntryCount         5
ContentionCount    5
*** Locked


Clearly, that one is owned by T1 (0xc74). I is a classic case of obtaining the locks in the wrong order. I was curious to determine where exactly does T1 grab the ntdll!LdrpLoaderLock.

Some poking around with windbg later, I noticed that this call (at the bottom):

...
011ee8b0 7c916178 ntdll!LdrpRunInitializeRoutines+0x344
011eeb5c 7c9162da ntdll!LdrpLoadDll+0x3e5
011eee04 7c801bb9 ntdll!LdrLoadDll+0x230
...


does the following:

...
7c916293 e8d9ceffff      call    ntdll!LdrLockLoaderLock (7c913171)
7c916298 895dfc          mov     dword ptr [ebp-4],ebx
7c91629b 8b357cc1977c    mov     esi,dword ptr [ntdll!LdrpTopLevelDllBeingLoaded (7c97c17c)]
7c9162a1 89b5bcfdffff    mov     dword ptr [ebp-244h],esi
7c9162a7 3bf3            cmp     esi,ebx
7c9162a9 0f8599450100    jne     ntdll!LdrLoadDll+0xed (7c92a848)
...


So, I followed what the ntdll!LdrLockLoaderLock does:


...
ntdll!LdrLockLoaderLock+0x96:
7c9131c2 84cb            test    bl,cl
7c9131c4 0f847a160000    je      ntdll!LdrLockLoaderLock+0x117 (7c914844)
7c9131ca 68d8c0977c      push    offset ntdll!LdrpLoaderLock (7c97c0d8)
7c9131cf 3bc2            cmp     eax,edx
7c9131d1 0f8541540200    jne     ntdll!LdrLockLoaderLock+0xa3 (7c938618)
7c9131d7 e829defeff      call    ntdll!RtlEnterCriticalSection (7c901005)
7c9131dc 85f6            test    esi,esi
7c9131de 0f85a69f0200    jne     ntdll!LdrLockLoaderLock+0xee (7c93d18a)
...


I English: it gets the pointer stored at ntdll!LdrpLoaderLock (offset 7c97c0d8) and calls ntdll!RtlEnterCriticalSection.


Thus we have the proof. Now we need to figure out what should be done about it.  


--Iliyan Jeliazkov
Comment 18 Johnny Willemsen 2007-08-03 02:20:47 CDT
> The DLL-related system primitives on Windows (GetProcedureAddress, LoadLibrary,
> etc.) obtain lock(s) on their own, which can happen out-of-order with
> ACE_DLL_Handle's locking.
> 
> If I could refer to these locks: L1 (internal) and L2 (ACE_DLL_Handle)
> respectively, what happens is:
> 
>   T1
>     loads TAO_CSD_ThreadPool ...
>     loads TAO_PI
>     call ACE_DLL_Handle::open()
>     lock L2
>     call ::GetProcedureAddress
>     lock L1
> 
> 
>   T2
>     call ::LoadLibrary
>     lock L1
>     loads TAO_PI
>     call ACE_DLL_Handle::symbol()
>     lock L2 
> 
> 
> 
> What can we do? 
> 
> Perhaps we can not hold the lock (L2) in symbol? But then it will make it
> possible for symbol() to fail just because another thread is in the middle of
> opening the DLL. Perhaps, not on Windows? The OS seems to be protected from
> retrieving a symbol of a library that's still loading, but is ACE?

I have checked the test stats, it seems this problem is only happening on windows, this test runs perfect on other platforms now. 
 
> (I dare not enduring similar reverse engineering session on Linux. And on
> Solaris. And on ... A-a-a-a-a! :)
> 
> Maybe we can employ a sort of double-checked-locking pattern in symbol(): check
> the handle_ first and if it is not okay the get the lock_ and check again?
> 
> 
> I'm interested in getting opinions.

What about just not getting the lock on windows in symbol. As you pointed out it seems windows seems to guard its state. 
Comment 19 Adam Mitz 2007-08-03 07:29:27 CDT
Iliyan,
Please show where in ACE the lock (ACE's, not the OS's) is obtained in T2.  I can't find a ACE_DLL_Handle::open() in the stack so I guess some other method is guarded with the same lock.

Thanks,
Adam
Comment 20 Iliyan Jeliazkov 2007-08-03 09:08:39 CDT
Hi Adam,

Thanks for the windbg tip! I couldn't set the symbol server correctly for some reason, but I didn't try very hard. It also looks like the VC8 debugger is at least as good as windbg.
 
(In reply to comment #19)
> Iliyan,
> Please show where in ACE the lock (ACE's, not the OS's) is obtained in T2.  I
> can't find a ACE_DLL_Handle::open() in the stack so I guess some other method
> is guarded with the same lock.

It's in:

...
010af2dc 10076ae4 ACEd!ACE_DLL_Handle::symbol(char * sym_name = 0x00ec8f48
"_make_ORBInitializer_Registry", int ignore_errors = 0)+0xb7
...


The third other place L2 is obtained is in close().

In terms of what to do Johnny's suggestion is mostly along the lines of what I'm thinking as it seems L2 is there to do pretty much what Windows does internally with L1. 

I think we need to keep the lock there, but grab it only when changing the ACE_DLL_Handle's state. I did some experiments with ACE_Reverse_Lock and the results were good, but in no way conclusive enough. 

Perhaps we need an ACE test reproducing the situation - it is bound to be very common as the complexity of DLL dependencies increase over time. Something like, two threads - one loading DLL1 and another loading DLL2. DLL2 will have to in turn load DLL1 as part of its initialization. The tricky part is to ensure T1 asks for a symbol() (or close()!) _after_ T2 has loaded DLL2 and _before_ loading DLL1.
 
Comment 21 Adam Mitz 2007-08-03 15:18:11 CDT
(In reply to comment #20)
> Hi Adam,
> 
> Thanks for the windbg tip! 

No problem.

> I couldn't set the symbol server correctly for some
> reason, but I didn't try very hard. 

Next time you're in the office I'll show you...

> It also looks like the VC8 debugger is at
> least as good as windbg.

In what respect?  It won't do !critsec for you, right?  Anyway that's a tangent, you can email me off the bugzilla entry.

> 
> (In reply to comment #19)
> > Iliyan,
> > Please show where in ACE the lock (ACE's, not the OS's) is obtained in T2.  I
> > can't find a ACE_DLL_Handle::open() in the stack so I guess some other method
> > is guarded with the same lock.
> 
> It's in:
> 
> ...
> 010af2dc 10076ae4 ACEd!ACE_DLL_Handle::symbol(char * sym_name = 0x00ec8f48
> "_make_ORBInitializer_Registry", int ignore_errors = 0)+0xb7
> ...
> 
> 
> The third other place L2 is obtained is in close().
> 
> In terms of what to do Johnny's suggestion is mostly along the lines of what
> I'm thinking as it seems L2 is there to do pretty much what Windows does
> internally with L1. 
> 
> I think we need to keep the lock there, but grab it only when changing the
> ACE_DLL_Handle's state. I did some experiments with ACE_Reverse_Lock and the
> results were good, but in no way conclusive enough. 

ACE_DLL_Handle::symbol() may not even need the lock if it's not modifying the state or examining any other state that could possibly change concurrently (this->handle_ can't change after construction, can it?).

Even with reducing the scope of ACE_DLL_Handle::lock_, I think this particular situation is not legal (at least in Windows).  The problem is that in T1 the initialization of the dynamic lib for CSD is itself loading the TAO_PI lib dynamically.  This is explicitly disallowed by the Windows documentation:

"Threads in DllMain hold the loader lock so no additional DLLs can be dynamically loaded or initialized.  [Side note -- this is kind of strange since Windows critical sections are recursive]  The entry-point function (DllMain, here it's called _DllMainCRTStartup) should perform only simple initialization or termination tasks. It must not call the LoadLibrary or LoadLibraryEx function (or a function that calls these functions), because this may create dependency loops in the DLL load order. This can result in a DLL being used before the system has executed its initialization code."


Comment 22 Iliyan Jeliazkov 2007-08-03 19:08:45 CDT
(In reply to comment #21)
> (In reply to comment #20)
> > (In reply to comment #19)
> >
> > ... 
> > In terms of what to do Johnny's suggestion is mostly along the lines of what
> > I'm thinking as it seems L2 is there to do pretty much what Windows does
> > internally with L1. 
> > 
> > I think we need to keep the lock there, but grab it only when changing the
> > ACE_DLL_Handle's state. I did some experiments with ACE_Reverse_Lock and the
> > results were good, but in no way conclusive enough. 
> 
> ACE_DLL_Handle::symbol() may not even need the lock if it's not modifying the
> state or examining any other state that could possibly change concurrently
> (this->handle_ can't change after construction, can it?).

In a separate email Johnny reported some issues with removing that lock, but I haven't done sufficient testing to confirm any of it.

Johnny, could you please elaborate?

> Even with reducing the scope of ACE_DLL_Handle::lock_, I think this particular
> situation is not legal (at least in Windows).  The problem is that in T1 the
> initialization of the dynamic lib for CSD is itself loading the TAO_PI lib
> dynamically.  This is explicitly disallowed by the Windows documentation:
> 
> "Threads in DllMain hold the loader lock so no additional DLLs can be
> dynamically loaded or initialized.  [Side note -- this is kind of strange since
> Windows critical sections are recursive]  

Right, if this were true we wouldn't make it to initializing TAO_PI, correct?

> The entry-point function (DllMain,
> here it's called _DllMainCRTStartup) should perform only simple initialization
> or termination tasks. It must not call the LoadLibrary or LoadLibraryEx
> function (or a function that calls these functions), because this may create
> dependency loops in the DLL load order. This can result in a DLL being used
> before the system has executed its initialization code."

Yes, ugly! But for some people it is important to be able to package the whole ORB in a DLL and I think it is a valid use case.

I hate to say it, but TAO's service config can perhaps be made to queue requests for loading DLL's from services that are being still loaded themselves.  Or borrow the idea from java's manifest files.  No idea right now how exactly any of this might be accomplished, though ...
Comment 23 Johnny Willemsen 2007-08-05 13:41:50 CDT
> Even with reducing the scope of ACE_DLL_Handle::lock_, I think this particular
> situation is not legal (at least in Windows).  The problem is that in T1 the
> initialization of the dynamic lib for CSD is itself loading the TAO_PI lib
> dynamically.  This is explicitly disallowed by the Windows documentation:
> 
> "Threads in DllMain hold the loader lock so no additional DLLs can be
> dynamically loaded or initialized.  [Side note -- this is kind of strange since
> Windows critical sections are recursive]  The entry-point function (DllMain,
> here it's called _DllMainCRTStartup) should perform only simple initialization
> or termination tasks. It must not call the LoadLibrary or LoadLibraryEx
> function (or a function that calls these functions), because this may create
> dependency loops in the DLL load order. This can result in a DLL being used
> before the system has executed its initialization code."
 
Ok, this is important info. If Windows says you shouldn't do this, let us then zap this code and rewrite it. On my system I moved to registration of the ORB Initializer to a new init method on a loader that is then called from TAO_Internal.cpp in method register_additional_services_i. I added there a call to look for the CSD_Framework_Loader, if it is there, call init on it. So the registration is not done in a static initializer anymore. Now the deadlock is gone. It looks some more core libraries have orbinitializer code in their static initialization, let us also refactor that. 

With the rewrite of the code we follow the windows guidelines, the orb can be loaded again as service. We can always try to fix this bugzilla issue but I doubt whether it is something we should do when windows says don't do this (and this is a windows specific issue)
Comment 24 Johnny Willemsen 2007-08-05 13:43:20 CDT
(In reply to comment #22)

> In a separate email Johnny reported some issues with removing that lock, but I
> haven't done sufficient testing to confirm any of it.
> 
> Johnny, could you please elaborate?

WHen I removed the lock I ended up with multiple ORBInitializer Registries which resulted in bug 2995

> I hate to say it, but TAO's service config can perhaps be made to queue
> requests for loading DLL's from services that are being still loaded
> themselves.  Or borrow the idea from java's manifest files.  No idea right now
> how exactly any of this might be accomplished, though ...

Maybe ACE can do this, also no idea how todo that at this moment, but that can be an extension. 

Comment 25 Adam Mitz 2007-08-05 18:21:38 CDT
(In reply to comment #22)
> > Even with reducing the scope of ACE_DLL_Handle::lock_, I think this particular
> > situation is not legal (at least in Windows).  The problem is that in T1 the
> > initialization of the dynamic lib for CSD is itself loading the TAO_PI lib
> > dynamically.  This is explicitly disallowed by the Windows documentation:
> > 
> > "Threads in DllMain hold the loader lock so no additional DLLs can be
> > dynamically loaded or initialized.  [Side note -- this is kind of strange since
> > Windows critical sections are recursive]  
> 
> Right, if this were true we wouldn't make it to initializing TAO_PI, correct?

Well, we haven't anyway since in the stack traces I've seen ACE's lock is preventing us from getting there.  But I guess you're talking about the general case (not the deadlock).  Right, after reading a *ton* of articles about the loader locking in Windows, I think this documentation is a little misleading but still correct.  The locking that the loader itself performs does not itself make it illegal to call LoadLibrary from dllmain (remember, constructors of static objects in a dll run within the context of dllmain).  It won't deadlock (since the critsec is recursive), but it *may* result in code running in some other DLL before that DLL has been initialized.  This will only happen if a loop in the loading order is created.  The situation is that before your dllmain is called a loading plan is created so that dlls are initialized in correct dependency order.  Doing a LoadLibrary in the midst of this initialization may invalidate this loading plan.  With a loop in library dependencies, one library will see LoadLibrary() and GetProcAddress() succeed for a second library even though that second library hasn't been initialized yet (static ctors haven't run).  The first library can then call into the second, which very well might crash.


> 
> > The entry-point function (DllMain,
> > here it's called _DllMainCRTStartup) should perform only simple initialization
> > or termination tasks. It must not call the LoadLibrary or LoadLibraryEx
> > function (or a function that calls these functions), because this may create
> > dependency loops in the DLL load order. This can result in a DLL being used
> > before the system has executed its initialization code."
> 
> Yes, ugly! But for some people it is important to be able to package the whole
> ORB in a DLL and I think it is a valid use case.

This has nothing to do with the loading issue.  You can still package whatever you want in a DLL (one good idea would be to explicitly link -- i.e. at build time -- to all needed ACE+TAO libs).  In the fully-dynamic implicit linking case (counting on LoadLibrary() for PI, etc.) the app may need to call separate initialization functions instead of expecting all the initialization to be done during LoadLibrary() of the "ORB package" DLL.

> 
> I hate to say it, but TAO's service config can perhaps be made to queue
> requests for loading DLL's from services that are being still loaded
> themselves.  Or borrow the idea from java's manifest files.  No idea right now
> how exactly any of this might be accomplished, though ...
> 

I'm not sure how this fixes it.  The rule is to avoid calling LoadLibrary() from any code within the context of dllmain (aka static constructors).
Comment 26 Adam Mitz 2007-08-05 18:42:02 CDT
(In reply to comment #23)
> Ok, this is important info. If Windows says you shouldn't do this, let us then
> zap this code and rewrite it. On my system I moved to registration of the ORB
> Initializer to a new init method on a loader that is then called from
> TAO_Internal.cpp in method register_additional_services_i. I added there a call
> to look for the CSD_Framework_Loader, if it is there, call init on it. So the
> registration is not done in a static initializer anymore. Now the deadlock is
> gone. It looks some more core libraries have orbinitializer code in their
> static initialization, let us also refactor that. 

I can't claim to follow all the details (maybe you can attach a patch?) but I'm not sure I need to.  The approach sounds valid anyway.  I think the telltale sign would be anything that invokes a "dynamic" operation on the ACE svc config from a static object constructor, unless we can show that the library must be loaded already.

> 
> With the rewrite of the code we follow the windows guidelines, the orb can be
> loaded again as service. We can always try to fix this bugzilla issue but I
> doubt whether it is something we should do when windows says don't do this (and
> this is a windows specific issue)
> 

Besides changes to TAO as you outlined above, ACE still must change to prevent uses from hitting this issue and deadlocking.  

One way to do this would be to (on Win32) replace the per-instance lock in ACE_DLL_Handle with a static lock.  This would prevent the deadlock seen in this bug since all calls to the OS-level locking functions would be guarded by the same lock in ACE, yielding a consistent locking order.

Another approach would be to use the lock in ACE_DLL_Handle to just guard its internal state, but not hold it across any calls to the system.  I will look at 2995 to see what you've done with this so far.
Comment 27 Johnny Willemsen 2007-08-06 02:30:37 CDT
lowering priority and severity, the deadlock can't appear now anymore because I changed the way the PI library is initialized. There are ideas to help users to prevent these kind of problems when they have similar setups, but that is more an enhancement. Also removed the target milestone of 1.6