Bug 3112 - move monitor and control capabilities to ACE
Summary: move monitor and control capabilities to ACE
Status: ASSIGNED
Alias: None
Product: TAO
Classification: Unclassified
Component: other (show other bugs)
Version: 1.6.1
Hardware: All All
: P5 enhancement
Assignee: DOC Center Support List (internal)
URL:
Depends on:
Blocks:
 
Reported: 2007-10-25 10:32 CDT by Jeff Parsons
Modified: 2018-01-15 11:21 CST (History)
2 users (show)

See Also:


Attachments
Original statement of work from Symantec (6.29 KB, text/plain)
2007-10-25 10:44 CDT, Jeff Parsons
Details
Design document for Notification Service MC extensions (40.00 KB, application/octet-stream)
2007-10-26 10:37 CDT, Dale Wilson
Details
summary of issues, resolutions, additional points in 11/12/07 telecon (1.37 KB, text/plain)
2007-11-09 14:46 CST, Jeff Parsons
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeff Parsons 2007-10-25 10:32:40 CDT
This entry captures thoughts and ideas on Symantec-sponsored work to add functionality to the middleware for applications to monitor and control various things. This sort of functionality (tailored to CORBA event channels) has already been added by OCI to the TAO Notification Service. Our intention is to leverage as much of this effort as possible while generalizing its application, in part by moving some existing classes and data structures to a lower level (ACE and/or TAO) from their current location in ORB services. New code will of course be added as well.
Comment 1 Jeff Parsons 2007-10-25 10:44:48 CDT
Created attachment 859 [details]
Original statement of work from Symantec
Comment 2 Jeff Parsons 2007-10-25 10:47:03 CDT
I guess it makes the most sense to list stuff that I don't see
in the OCI stuff:

- toggling monitors

- late binding of monitors (possibly trivial)

- grouping monitors

- resetting counters (possibly trivial)

- outputting via ACE logging

- trigger rules and constraint language

Probably the last item will be the most work,
particularly the constaint language.

The rest of the stuff in the statement of work just seems
to specify the particular things that are to be monitored.

The running of arbitrary code doesn't look like such a big
job - there is already a control API in the OCI design, 
which has an 'execute' method which checks for a match
with a string command, then executes code that the 
developer writes in a subclass. So we would have to fill
in the 'execute' method appropriately and implement
the code that glues a monitor to a control (I guess this
will be where the trigger and constraint language
stuff come in). We might also want to come up with a
faster way of matching the command than a string
compare ;-). For the
extended Notification Service, OCI has implemented only
a 'shutdown' subclass of its control API, but it shows
how simple it is to do.

Fetching and modifying monitors shouldn't be too hard
either. The OCI design puts all monitors and controls
into a global registry that is an ACE singleton, each stored
by a string name. Then the extended Notification Service
has each event channel keep a list of the names of the
monitors it has created (the underlying container is an
ACE hash map). There are several monitors and
a single control (shutdown) subclassed for this extended
NS. We can come up with more generic sets for ACE, TAO,
etc.

I guess the above is a good starting point for discussion.

Comment 3 Jeff Parsons 2007-10-25 11:03:04 CDT
Some general questions about the design. Johnny and I discussed these already, but I'm capturing the questions here for posterity.

- Threading model? The exisiting OCI implementation for the Notification Service has a Monitor Manager that creates each Monitor or Control object in a separate thread. We may not want to do that, or we may want to make it flexible.

- Reuse of IDL? The existing version defines everything in IDL - interfaces for monitor/control objects and monitor managers, plus associated data structures. Obviously if we move stuff to ACE, we would just have to duplicate the corresponding C++ types. Johnny agrees, but also suggests that some of the IDL may be relocated to a TAO library. Such relocation will require some refactoring of the IDL, since, for example, the monitor/control interface contains an event-channel-specific operation. I don't think this refactoring will be hard to do.

- Exceptions? Some of the IDL data structures mentioned above are exceptions. I'd personally like to follow this style even in the ACE code, but I'd also like more input on it.
Comment 4 Jeff Parsons 2007-10-25 11:19:57 CDT
A couple of more specific questions to be keeping in mind as we work on the design.

- existing orbsvcs version uses a class MonitorControl_Notify_Service, which is a subclass of CosNotify_Service, subclassed in turn from Notify_Service, which inherits from ACE_Service_Object. So in the related test code, the executable is started from a conf file called notify.conf. The executable then creates an event channel factory and (in the monitor/control version) a Monitor Manager which is an ACE singleton. Disregarding the event-channel-related stuff, do we want to use this mechanism in our generalized version?

- existing obsvcs version has a hierarchy of stats-related classes. There is a generic base class, from which the stats and control classes both inherit, and which basically just stores a string name. The subclass of this class is the meat of the statistics functionality. Then there is a template class that inherits from the 'meat' class. The single template parameter is an interface type, an instance of which is passed to the constructor and stored as a class memeber. The existing version creates specialized stats classes by subclassing the template class and passing in a pointer to the event channel whihc is the template parameter. This subclass then overrides the calculate() method, the body of which makes calls to the event channel. Will this technique be of use to us, that is, in a more general context?
Comment 5 Dale Wilson 2007-10-26 10:37:52 CDT
Created attachment 863 [details]
Design document for Notification Service MC extensions

I'm attaching the original design document for the Notification Service Monitoring And Control Capabilities (Notification MC) work done by OCI.   This may help answer the perennial "What Were They Thinking" question about the work that that this project is planning to generalize.  A couple of points, I'll mention:  

Using a single data type -- double -- for statistics eliminates a lot of complexity.  Doubles behave well in extreme conditions -- losing precision rather than meaning.

The issue of naming things is important and deserves a lot of thought -- especially if we want to have a general purpose monitor and control program that does not need hard-code information about the target system.

HTH
Dale
Comment 6 Dale Wilson 2007-10-26 15:58:29 CDT
Two additional comments in response to Jeff's analysis of the existing OCI work:
Jeff didn't see:

- grouping monitors
This is handled via a hierarchical naming system for statistics -- see the MC design doc for details.

- resetting counters (possibly trivial)
Two reset functions are provided:  Read and reset (atomic), and a simple reset.

Comment 7 Johnny Willemsen 2007-10-29 14:12:22 CDT
Some old feedback from Symantec on the SOW

	1) Do we really need a complex rule / constraint language?  

  	   Good question.  I don't think we do - I would like this to be
pretty simplistic / easy to implement and use.  Initially we might just
support a simple X > Y threshold crossings.  I don't think we want
anything nearly as complex as the ETCL grammar.

	2) Do we really need to run arbitrary code?  

	   Another good question.  I think we do, and I think this will
likely be easy to implement.  Let's make sure we implement a simple
logging action for the first pass. 

	3) I've seen things like this cause a 10% hit on performance -
can we limit this to just debug builds?

	   I think the key value for this is in production builds - so
while we do want to support conditional compilation of this, I would
need the performance hit to be must very low (say 1-2%).  We may need to
conditionally compile the code based on the area being monitored to hit
this performance target... (i.e., enable compilation in notify service
and connection cache, disable in orb).  This means there cannot be a
single ON/OFF switch - we will need a per area or type means on enabling
this.  I think there should be some preference given to using templates
instead of macros... 

Comment 8 Johnny Willemsen 2007-10-29 14:14:44 CDT
Idea on implementing the policies from Ossama

I'm not sure what others think, but I was thinking more along the lines
of Andrei Alexandrescu's "policy-based design":

	http://en.wikipedia.org/wiki/Policy-based_design

Conceivably you could have a "null counter" policy by default.  However,
I suppose we'd need find out how much overhead, if any, is added by
instantiating a no-op counter.  If it's non-trivial, I suppose we could
wrap it within a macro similar to what we do with the ACE_MT macro.
Another option could be to leverage FOCUS.

Alexandrescu's book, "Modern C++ Design" is excellent, by the way.
Comment 9 Jeff Parsons 2007-10-30 12:27:54 CDT
What are 'the policies'? The things being monitored?
Comment 10 Johnny Willemsen 2007-10-31 08:23:31 CDT
with policy it is meant the fact whether we do monitoring or not. instead of defines use templates
Comment 11 Jeff Parsons 2007-11-05 12:08:32 CST
More questions regarding vagueness in the SOW:

Runtime monitor toggling - do they want interactive or programmatic only?

Logging - they want remotely accessible logging info - does that require ACE distributed logging or will remotely accessible log files be enough?
Comment 12 Johnny Willemsen 2007-11-06 08:31:26 CST
(In reply to comment #11)
> More questions regarding vagueness in the SOW:
> 
> Runtime monitor toggling - do they want interactive or programmatic only?

Not sure

> Logging - they want remotely accessible logging info - does that require ACE
> distributed logging or will remotely accessible log files be enough?

I think we just have to use ACE logging, then the ACE logging framework delivers the user the flexibility to redirect the output
Comment 13 Jeff Parsons 2007-11-06 08:43:44 CST
Right but there are two levels of ACE logging - the simpler way of using the singleton class plus the associated macros, or the full-blown logging server and logging client/proxy. Sorry, I should have made that distinction clearer in my question.

On an unrelated note, I notice that the SOW talks about monitoring CPU utilization and memory usage. I also notice that there is nothing provided in ACE to do this directly. However, Will pointed me to a paper on a platform-independent API for doing this kind of stuff. Although the tool implemented by the authors is in Python, the paper talks at some length about the underlying system interfaces (for Windows, Linux and Solaris) used by the tool. This information should be enough to guide the implementation of classes in ACE to accomplish the same things in C++. I think it would be a nice addition to the library.
Comment 14 Jeff Parsons 2007-11-09 14:46:36 CST
Created attachment 867 [details]
summary of issues, resolutions, additional points in 11/12/07 telecon
Comment 15 Jeff Parsons 2007-11-12 10:43:12 CST
Comment on attachment 867 [details]
summary of issues, resolutions, additional points in 11/12/07 telecon

Resolutions

1. No interactive toggling of monitor itself is needed. However, logging and constraint checking can be interactive (or controlled by a cron job) to keep monitors as lightweight as possible by default. Logging could be tied to constraint checking, maybe even be the default trigger action, overridden when there is custom action code.

2. Web services was mentioned, but there's no need to address that use case specifically at this time. We can divide our predefined counters into three groups: ACE (low level resources), TAO (CORBA-specific) and Notification Service-specific.

3. Since ETCL depends only on ACE, it's acceptable to use it as a ready-made constraint creator/checker.

4. Distributed logging is not needed, Symantec has their own similar mechanism that they will plug in to the simpler version of ACE logging, using the class ACE_Log_Msg and associated macros.

Other points

- Physical memory location of a monitor should be accessible given its string name, so core dumps can be analyzed by a debugger or some other offline tool.

- 64-bit counters across the board are acceptable, the memory overhead won't be a problem. In the future, we might think about an overflow-proof counter, however.
Comment 16 Jeff Parsons 2007-11-28 10:58:13 CST
If the logging output of all monitors is to go to a single sink (otherwise each monitor point would need its own instantiation of ACE_Log_Msg), I don't see how we can further simplify the ACE logging API. The application might as well use it directly - no need to integrate it with the monitoring API. Instead of global vs per-monitor, we might be asked to make the granularity per-application, but if each application doesn't run in a separate thread, that would be very tricky.
Comment 17 Jeff Parsons 2007-11-28 13:42:20 CST
Regarding the SOW requirement that a disabled monitor point have no performance overhead - if a monitor point's data need never be passed directly to the application, all accessor methods that return a value can be eliminated. If all remaining methods return void, the no-op version of these methods (for a template specialization of the class where a boolean 'enabled' parameter is FALSE) may be optimized away by the compiler. This situation is doable if data is sent out only via ACE logging macros to some sink available to ACE_Log_Msg.
Comment 18 Jeff Parsons 2007-12-18 11:23:54 CST
Summary of 12/18 telecon with sponsor:

Design suggestions:

- All string names limited to ASCII identifiers (I've
  restricted this to CORBA-compliant ASCII identifiers
  in the requirements doc - DONE).

- Add readymade TAO monitor for the depth of a nested
  upcall (requirements doc, class diagram and doxygen
  files have been updated - DONE).

- Add monitor point lifecycle diagram (TODO).

- Improve some class name choices in class & sequence
  diagrams (TODO).

Implementation suggestions:

- Lock only writes to the repository - reads don't lock
  and check a counter/dirty bit before & after, repeat
  read as necessary.

- Create mechanism to deal with hysteresis (flood of 
  triggered actions due to jitter in monitored value
  around a constraint threshold).

Next steps:

- Begin implementation of a simple end-to-end use case,
  for example CPU load monitoring.

- Have another telecon sometime after the holidays.

Comment 19 Jeff Parsons 2008-03-05 10:56:37 CST
Here's new and pertinent stuff that came out of
the telecon with Andrew Schnable on 3/5/08

- desired monitor in TAO: frequency of connection 
  cache flush

- desired monitors in Notification Service:
  # of admins, # of proxies (I haven't checked,
  these may already exist)

- ORB & level monitors may involve some
  statistics (for example an average - we can get
  more input from Symantec when needed). Of course
  Notification Service monitors already do stats

- both updates & queries of monitor values must be
  thread-safe

- need support for periodic queries, similar to
  periodic updates

- need support for multiple constraints on a monitor
  point, each with its associated control action 
  (constraint lists are already supported in NS filters,
  but not yet in the ACE ETCL subset)

- constraints should be evaluated at query time,
  rather than at update time as we now have

- performance tests should use a simple (maybe already
  existing) TAO benchmark and compare results with
  and without monitors, to get % performance hit
  incurred by monitors