Please report new issues athttps://github.com/DOCGroup
This entry captures thoughts and ideas on Symantec-sponsored work to add functionality to the middleware for applications to monitor and control various things. This sort of functionality (tailored to CORBA event channels) has already been added by OCI to the TAO Notification Service. Our intention is to leverage as much of this effort as possible while generalizing its application, in part by moving some existing classes and data structures to a lower level (ACE and/or TAO) from their current location in ORB services. New code will of course be added as well.
Created attachment 859 [details] Original statement of work from Symantec
I guess it makes the most sense to list stuff that I don't see in the OCI stuff: - toggling monitors - late binding of monitors (possibly trivial) - grouping monitors - resetting counters (possibly trivial) - outputting via ACE logging - trigger rules and constraint language Probably the last item will be the most work, particularly the constaint language. The rest of the stuff in the statement of work just seems to specify the particular things that are to be monitored. The running of arbitrary code doesn't look like such a big job - there is already a control API in the OCI design, which has an 'execute' method which checks for a match with a string command, then executes code that the developer writes in a subclass. So we would have to fill in the 'execute' method appropriately and implement the code that glues a monitor to a control (I guess this will be where the trigger and constraint language stuff come in). We might also want to come up with a faster way of matching the command than a string compare ;-). For the extended Notification Service, OCI has implemented only a 'shutdown' subclass of its control API, but it shows how simple it is to do. Fetching and modifying monitors shouldn't be too hard either. The OCI design puts all monitors and controls into a global registry that is an ACE singleton, each stored by a string name. Then the extended Notification Service has each event channel keep a list of the names of the monitors it has created (the underlying container is an ACE hash map). There are several monitors and a single control (shutdown) subclassed for this extended NS. We can come up with more generic sets for ACE, TAO, etc. I guess the above is a good starting point for discussion.
Some general questions about the design. Johnny and I discussed these already, but I'm capturing the questions here for posterity. - Threading model? The exisiting OCI implementation for the Notification Service has a Monitor Manager that creates each Monitor or Control object in a separate thread. We may not want to do that, or we may want to make it flexible. - Reuse of IDL? The existing version defines everything in IDL - interfaces for monitor/control objects and monitor managers, plus associated data structures. Obviously if we move stuff to ACE, we would just have to duplicate the corresponding C++ types. Johnny agrees, but also suggests that some of the IDL may be relocated to a TAO library. Such relocation will require some refactoring of the IDL, since, for example, the monitor/control interface contains an event-channel-specific operation. I don't think this refactoring will be hard to do. - Exceptions? Some of the IDL data structures mentioned above are exceptions. I'd personally like to follow this style even in the ACE code, but I'd also like more input on it.
A couple of more specific questions to be keeping in mind as we work on the design. - existing orbsvcs version uses a class MonitorControl_Notify_Service, which is a subclass of CosNotify_Service, subclassed in turn from Notify_Service, which inherits from ACE_Service_Object. So in the related test code, the executable is started from a conf file called notify.conf. The executable then creates an event channel factory and (in the monitor/control version) a Monitor Manager which is an ACE singleton. Disregarding the event-channel-related stuff, do we want to use this mechanism in our generalized version? - existing obsvcs version has a hierarchy of stats-related classes. There is a generic base class, from which the stats and control classes both inherit, and which basically just stores a string name. The subclass of this class is the meat of the statistics functionality. Then there is a template class that inherits from the 'meat' class. The single template parameter is an interface type, an instance of which is passed to the constructor and stored as a class memeber. The existing version creates specialized stats classes by subclassing the template class and passing in a pointer to the event channel whihc is the template parameter. This subclass then overrides the calculate() method, the body of which makes calls to the event channel. Will this technique be of use to us, that is, in a more general context?
Created attachment 863 [details] Design document for Notification Service MC extensions I'm attaching the original design document for the Notification Service Monitoring And Control Capabilities (Notification MC) work done by OCI. This may help answer the perennial "What Were They Thinking" question about the work that that this project is planning to generalize. A couple of points, I'll mention: Using a single data type -- double -- for statistics eliminates a lot of complexity. Doubles behave well in extreme conditions -- losing precision rather than meaning. The issue of naming things is important and deserves a lot of thought -- especially if we want to have a general purpose monitor and control program that does not need hard-code information about the target system. HTH Dale
Two additional comments in response to Jeff's analysis of the existing OCI work: Jeff didn't see: - grouping monitors This is handled via a hierarchical naming system for statistics -- see the MC design doc for details. - resetting counters (possibly trivial) Two reset functions are provided: Read and reset (atomic), and a simple reset.
Some old feedback from Symantec on the SOW 1) Do we really need a complex rule / constraint language? Good question. I don't think we do - I would like this to be pretty simplistic / easy to implement and use. Initially we might just support a simple X > Y threshold crossings. I don't think we want anything nearly as complex as the ETCL grammar. 2) Do we really need to run arbitrary code? Another good question. I think we do, and I think this will likely be easy to implement. Let's make sure we implement a simple logging action for the first pass. 3) I've seen things like this cause a 10% hit on performance - can we limit this to just debug builds? I think the key value for this is in production builds - so while we do want to support conditional compilation of this, I would need the performance hit to be must very low (say 1-2%). We may need to conditionally compile the code based on the area being monitored to hit this performance target... (i.e., enable compilation in notify service and connection cache, disable in orb). This means there cannot be a single ON/OFF switch - we will need a per area or type means on enabling this. I think there should be some preference given to using templates instead of macros...
Idea on implementing the policies from Ossama I'm not sure what others think, but I was thinking more along the lines of Andrei Alexandrescu's "policy-based design": http://en.wikipedia.org/wiki/Policy-based_design Conceivably you could have a "null counter" policy by default. However, I suppose we'd need find out how much overhead, if any, is added by instantiating a no-op counter. If it's non-trivial, I suppose we could wrap it within a macro similar to what we do with the ACE_MT macro. Another option could be to leverage FOCUS. Alexandrescu's book, "Modern C++ Design" is excellent, by the way.
What are 'the policies'? The things being monitored?
with policy it is meant the fact whether we do monitoring or not. instead of defines use templates
More questions regarding vagueness in the SOW: Runtime monitor toggling - do they want interactive or programmatic only? Logging - they want remotely accessible logging info - does that require ACE distributed logging or will remotely accessible log files be enough?
(In reply to comment #11) > More questions regarding vagueness in the SOW: > > Runtime monitor toggling - do they want interactive or programmatic only? Not sure > Logging - they want remotely accessible logging info - does that require ACE > distributed logging or will remotely accessible log files be enough? I think we just have to use ACE logging, then the ACE logging framework delivers the user the flexibility to redirect the output
Right but there are two levels of ACE logging - the simpler way of using the singleton class plus the associated macros, or the full-blown logging server and logging client/proxy. Sorry, I should have made that distinction clearer in my question. On an unrelated note, I notice that the SOW talks about monitoring CPU utilization and memory usage. I also notice that there is nothing provided in ACE to do this directly. However, Will pointed me to a paper on a platform-independent API for doing this kind of stuff. Although the tool implemented by the authors is in Python, the paper talks at some length about the underlying system interfaces (for Windows, Linux and Solaris) used by the tool. This information should be enough to guide the implementation of classes in ACE to accomplish the same things in C++. I think it would be a nice addition to the library.
Created attachment 867 [details] summary of issues, resolutions, additional points in 11/12/07 telecon
Comment on attachment 867 [details] summary of issues, resolutions, additional points in 11/12/07 telecon Resolutions 1. No interactive toggling of monitor itself is needed. However, logging and constraint checking can be interactive (or controlled by a cron job) to keep monitors as lightweight as possible by default. Logging could be tied to constraint checking, maybe even be the default trigger action, overridden when there is custom action code. 2. Web services was mentioned, but there's no need to address that use case specifically at this time. We can divide our predefined counters into three groups: ACE (low level resources), TAO (CORBA-specific) and Notification Service-specific. 3. Since ETCL depends only on ACE, it's acceptable to use it as a ready-made constraint creator/checker. 4. Distributed logging is not needed, Symantec has their own similar mechanism that they will plug in to the simpler version of ACE logging, using the class ACE_Log_Msg and associated macros. Other points - Physical memory location of a monitor should be accessible given its string name, so core dumps can be analyzed by a debugger or some other offline tool. - 64-bit counters across the board are acceptable, the memory overhead won't be a problem. In the future, we might think about an overflow-proof counter, however.
If the logging output of all monitors is to go to a single sink (otherwise each monitor point would need its own instantiation of ACE_Log_Msg), I don't see how we can further simplify the ACE logging API. The application might as well use it directly - no need to integrate it with the monitoring API. Instead of global vs per-monitor, we might be asked to make the granularity per-application, but if each application doesn't run in a separate thread, that would be very tricky.
Regarding the SOW requirement that a disabled monitor point have no performance overhead - if a monitor point's data need never be passed directly to the application, all accessor methods that return a value can be eliminated. If all remaining methods return void, the no-op version of these methods (for a template specialization of the class where a boolean 'enabled' parameter is FALSE) may be optimized away by the compiler. This situation is doable if data is sent out only via ACE logging macros to some sink available to ACE_Log_Msg.
Summary of 12/18 telecon with sponsor: Design suggestions: - All string names limited to ASCII identifiers (I've restricted this to CORBA-compliant ASCII identifiers in the requirements doc - DONE). - Add readymade TAO monitor for the depth of a nested upcall (requirements doc, class diagram and doxygen files have been updated - DONE). - Add monitor point lifecycle diagram (TODO). - Improve some class name choices in class & sequence diagrams (TODO). Implementation suggestions: - Lock only writes to the repository - reads don't lock and check a counter/dirty bit before & after, repeat read as necessary. - Create mechanism to deal with hysteresis (flood of triggered actions due to jitter in monitored value around a constraint threshold). Next steps: - Begin implementation of a simple end-to-end use case, for example CPU load monitoring. - Have another telecon sometime after the holidays.
Here's new and pertinent stuff that came out of the telecon with Andrew Schnable on 3/5/08 - desired monitor in TAO: frequency of connection cache flush - desired monitors in Notification Service: # of admins, # of proxies (I haven't checked, these may already exist) - ORB & level monitors may involve some statistics (for example an average - we can get more input from Symantec when needed). Of course Notification Service monitors already do stats - both updates & queries of monitor values must be thread-safe - need support for periodic queries, similar to periodic updates - need support for multiple constraints on a monitor point, each with its associated control action (constraint lists are already supported in NS filters, but not yet in the ACE ETCL subset) - constraints should be evaluated at query time, rather than at update time as we now have - performance tests should use a simple (maybe already existing) TAO benchmark and compare results with and without monitors, to get % performance hit incurred by monitors