Summary: | Timeouts in multi-threaded apps can dead-lock the ORB | ||
---|---|---|---|
Product: | TAO | Reporter: | Carlos O'Ryan <coryan> |
Component: | ORB | Assignee: | Michael Kircher <Michael.Kircher> |
Status: | RESOLVED FIXED | ||
Severity: | blocker | ||
Priority: | P3 | ||
Version: | 1.1.13 | ||
Hardware: | All | ||
OS: | All | ||
Bug Depends on: | |||
Bug Blocks: | 132 |
Description
Carlos O'Ryan
2001-03-30 14:11:19 CST
Many of the changes to fix 132 cannot be fixed if this bug is not addressed. A few more details about this bug: I believe the Wait_On_Leader_Followers does not consider the scenario where thread A is a follower thread waiting for a reply with some timeout. Meanwhile thread B is the leader, it completes its work and signals A to become the new leader. However, before A realizes this it wakes up because of the timeout. The remove_follower() operation fails because it has already been removed by thread B (as part of the promotion to leader role), but even worse, thread A now simply returns, without electing a new leader. The ORB is now deadlocked unless another thread joins the leader/follower set. This is just a guess based on a brief inspection of the code, so take it with a grain of salt. If this is true probably the best fix is to use a guard idiom to ensure that a new leader is elected, no matter what exit path is used from the wait() operation. Assigned to Michael, he seems to have fixed it already. Carlos and I think we resolved this bug, though we were not able to reproduce the deadlock without this fix. Thu Apr 5 13:50:00 2001 Michael Kircher <Michael.Kircher@mchp.siemens.de> * tao/Wait_On_Leader_Follower.cpp: Carlos and I worked on a fix for [Bug 842]. This involves electing a new leader if a follower got elected but concurrently also received a timeout. In such cases a dead-lock could occur, but this fix prevents it from happening, as the leaving follower elects a new leader, as itself cannot fulfill that role. Such situations are detected by the condition that a timeout occured, remove_follower fails and reply_received equals 0. This fix applies exactly in such situations. |