Xymon Mailing List Archive search

Thoughts

list Dan Vande More
Wed, 2 May 2007 15:09:11 -0600
Message-Id: <user-c2666650f758@xymon.invalid>

Indeed, it seems to me that the whole group concept is a good way to work
with us humans but breaks down wildly when dealing with computers. This is
fine because most of us use the groups to save space on the screens, and
configuration in the conf files.

If you want tests for each process and ultimately different behaviours for
each process, you need to be prepared to do the work and make the tests for
each process.

Please don't overcomplicate hobbit for this - it's a corner case and will
ultimately make the program more unwieldy.

On 5/2/07, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:
Grouped items, such as the process check and log monitors, are issues.
A single process down causes the whole check to go red.  A process
listed as alerting only operators can then mask another process on the
same system from notifying the DBA's.  Setting the alert repeat interval
to 0 shows the other problem, a recovery message is not generated for
each process that recovers, only when the whole group of processes
recovers.
This will be difficult to handle - it's a very basic thing in the Hobbit
design that it only tracks the color of each status, not the details of
which rule (out of many) causes e.g. the "procs" column to go red.

To do that, you would need to associate some "event ID" with each of the
settings that can cause a red/yellow status; e.g. you'd have

   HOST=myhost
       PROC tnslistener 1 ID=100
       PROC httpd 4 ID=200

The "procs" status would then store the set of ID's that had been
triggered
for a status, and whenever there was a change in the set of triggered
rules it would pass this information to some process.

It can be done, but I am not particularly happy with it; it seems a bit
too
complex for my taste. If anyone has a better idea, please speak up.

(And just in case you wonder why I've used a new "event ID" instead of
re-using the existing "group" definition: I can easily imagine a
scenario where you have e.g. multiple processes monitored with alerts
going to one group of people (i.e. several PROC rules have the same
GROUP setting), but you still want to track exactly which processes are
up or down - and then you need a unique ID for each PROC rule).


Regards,
Henrik