I think that whatever solution decided should work for all other tests (in some form or another). My need for this would be the Disk report. I have a good number of Database Servers that have their disk fill up regularly. These disks are located on SAN and we either clean up the disk or put in a request for the SAN storage to be expanded. A storage expansion request could take 2-6 weeks to be fulfilled. So Ack'ing disk for that long 'blinds' you to any other disk issue that may crop up. So a way to ack just one volume, would be very desireable. I have actually written an ext test module to do this. I am still in the process of bring it over from BigBrother.
Another disk scenario I have, is similar to the point raised with ports. It is when a server is shared between 2 groups (or more). Being able to have multiple disk reports would be very welcomed. So groupA has a dedicate report for their volumes and so does groupB. I realize alerting can already be split up this way, but a way to split the reports would be a nice to have also. We do use Alternative Pagesets a lot, so that could be the reason that I like the idea of being able to split up a report into multiple reports. We create a Pageset for GroupA, with just the devices & reports that GroupA cares about. But when you have reports like disk, well disk could be red due to a GroupB volume. And this sometimes confuses GroupA :(
I think the simplest solution would be to have an parameter in the hobbit-clients.cfg:
DISK %(/mnt/Vol1|/mnt/Vol3|/mnt/Vol4) 90 95 REPORTALIAS=disk_a
DISK %(/mnt/Vol2|/mnt/Vol5|/mnt/Vol6) 80 95 REPORTALIAS=disk_b
DISK %!(disk_a|disk_b) 96 98
The last disk rule setting alert values for all other volumes, except those defined by disk_a & disk_b. The same REPORTALIAS feature could be used for MSGS, PORTS, PROCS, FILES, etc. And these alias names could be used in the alert rules, instead of GROUP=.
Now the above suggestion still does not help when a report has an alert status(red|yellow) and more alert items are added/subtracted. I would love the feature of being alerted when a report had more/less items in it than it did previously. The simplest way I see to do that is by including a alertstate field when the status is sent in to hobbit. I would imagine that this could be added to the report status first line, i.e
bin/bb 127.0.0.1 "status server1.disk red (red:/mnt/Vol1:/mnt/Vol2 yellow:/mnt/vol3)
<rest of disk report>"
So in the above example there are 2 volumes with a red status & one with a yellow. When the next status report comes in it has (red:/mnt/Vol1 yellow:/mnt/vol3), hobbit would be able to determine the report had a state change, even though the disk report would still have a red status. If reports do not provide this extra 'alertstate' field, it really shouldn't break anything. Hobbit would just behave as it does presently. Also a new alert parameter could be added, UPDATES. So people that want to receive emails whenever a report's alertstate changes can. And for people that just want alerts when reports have an alert status or recover, still can. The update alert emails can be as simple as, "server1's disk alert status has changed.", or can be complicated/informative "server1's disk /mnt/Vol2 alert status has cleared, but there are still disks that have met alert thresholds." Something else to consider is how this would affect acknowledgments. When acknowledging reports, I think a new option would be needed. Ack for the alert status, or Ack for the present alertstate. All depends on how you want to implement.
Sorry for the very long winded email, just trying to do a braindump of my thoughts. ~Steve
▸ quoted from Jason K. Kruse
On Wednesday 02 May 2007 17:24, Kruse, Jason K. wrote:
Actually, you just indirectly mentioned that feels like a fairly elegant
solution. What would be nice in this particular case would be to be able
to attach a service label to the PROCS tests for groups of processes. The
service could then be monitored without custom tests being created for each
one. New colums can be created from the service tag without really
cluttering the lines.
I'll have to think about how the log files are processed to see if
something like that works or not.
Jason
From: Dan Vande More [mailto:user-f3c4c62d9d50@xymon.invalid]
Sent: Wed 5/2/2007 4:09 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Thoughts
Indeed, it seems to me that the whole group concept is a good way to work
with us humans but breaks down wildly when dealing with computers. This is
fine because most of us use the groups to save space on the screens, and
configuration in the conf files.
If you want tests for each process and ultimately different behaviours for
each process, you need to be prepared to do the work and make the tests for
each process.
Please don't overcomplicate hobbit for this - it's a corner case and will
ultimately make the program more unwieldy.
On 5/2/07, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:
Grouped items, such as the process check and log monitors, are issues.
A single process down causes the whole check to go red. A process
listed as alerting only operators can then mask another process on the
same system from notifying the DBA's. Setting the alert repeat interval
to 0 shows the other problem, a recovery message is not generated for
each process that recovers, only when the whole group of processes
recovers.
This will be difficult to handle - it's a very basic thing in the Hobbit
design that it only tracks the color of each status, not the details of
which rule (out of many) causes e.g. the "procs" column to go red.
To do that, you would need to associate some "event ID" with each of the
settings that can cause a red/yellow status; e.g . you'd have
HOST=myhost
PROC tnslistener 1 ID=100
PROC httpd 4 ID=200
The "procs" status would then store the set of ID's that had been
triggered for a status, and whenever there was a change in the set of
triggered rules it would pass this information to some process.
It can be done, but I am not particularly happy with it; it seems a bit
too complex for my taste. If anyone has a better idea, please speak up.
(And just in case you wonder why I've used a new "event ID" instead of
re-using the existing "group" definition: I can easily imagine a
scenario where you have e.g. multiple processes monitored with alerts
going to one group of people (i.e. several PROC rules have the same
GROUP setting), but you still want to track exactly which processes are
up or down - and then you need a unique ID for each PROC rule).
Regards,
Henrik