Thoughts
list Jason K. Kruse
I'm going to be tasked to integrate hobbit with an event aggregator such as IBM netcool or EMC Smarts sometime in the near future. Most items will be cake, such as ping, port, and filesystem tests. Certain items, however, will be problematic from what I've seen. I'll describe the problems to see if anyone has ideas about how they can be addressed. The event aggregator does not want all status information for each item monitored, only triggered events that cause a state change. This is easy to configure using a script in the alerts file that matches all hosts for all colors. When a test recovers a recovery is sent and it disappears from our view. Grouped items, such as the process check and log monitors, are issues. A single process down causes the whole check to go red. A process listed as alerting only operators can then mask another process on the same system from notifying the DBA's. Setting the alert repeat interval to 0 shows the other problem, a recovery message is not generated for each process that recovers, only when the whole group of processes recovers. The only way I've been able to wrap my head around this is to use a database for the current state of all monitored processes and log files. I tried separate alert groups with rules to distinguish each target of a process separately but the rules match multiple times and lead to confusion. I tried using a channel on the hobbitd alert but the GROUP= items from the configuration file do not get passed as associated to the process, only on the page line. Any ideas or feedback are appreciated. Jason
list Henrik Størner
▸
On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:
Grouped items, such as the process check and log monitors, are issues. A single process down causes the whole check to go red. A process listed as alerting only operators can then mask another process on the same system from notifying the DBA's. Setting the alert repeat interval to 0 shows the other problem, a recovery message is not generated for each process that recovers, only when the whole group of processes recovers.
This will be difficult to handle - it's a very basic thing in the Hobbit
design that it only tracks the color of each status, not the details of
which rule (out of many) causes e.g. the "procs" column to go red.
To do that, you would need to associate some "event ID" with each of the
settings that can cause a red/yellow status; e.g. you'd have
HOST=myhost
PROC tnslistener 1 ID=100
PROC httpd 4 ID=200
The "procs" status would then store the set of ID's that had been triggered
for a status, and whenever there was a change in the set of triggered
rules it would pass this information to some process.
It can be done, but I am not particularly happy with it; it seems a bit too
complex for my taste. If anyone has a better idea, please speak up.
(And just in case you wonder why I've used a new "event ID" instead of
re-using the existing "group" definition: I can easily imagine a
scenario where you have e.g. multiple processes monitored with alerts
going to one group of people (i.e. several PROC rules have the same
GROUP setting), but you still want to track exactly which processes are
up or down - and then you need a unique ID for each PROC rule).
Regards,
Henrik
list Dan Vande More
Indeed, it seems to me that the whole group concept is a good way to work with us humans but breaks down wildly when dealing with computers. This is fine because most of us use the groups to save space on the screens, and configuration in the conf files. If you want tests for each process and ultimately different behaviours for each process, you need to be prepared to do the work and make the tests for each process. Please don't overcomplicate hobbit for this - it's a corner case and will ultimately make the program more unwieldy.
▸
On 5/2/07, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:Grouped items, such as the process check and log monitors, are issues. A single process down causes the whole check to go red. A process listed as alerting only operators can then mask another process on the same system from notifying the DBA's. Setting the alert repeat interval to 0 shows the other problem, a recovery message is not generated for each process that recovers, only when the whole group of processes recovers.This will be difficult to handle - it's a very basic thing in the Hobbit design that it only tracks the color of each status, not the details of which rule (out of many) causes e.g. the "procs" column to go red. To do that, you would need to associate some "event ID" with each of the settings that can cause a red/yellow status; e.g. you'd have HOST=myhost PROC tnslistener 1 ID=100 PROC httpd 4 ID=200 The "procs" status would then store the set of ID's that had been triggered for a status, and whenever there was a change in the set of triggered rules it would pass this information to some process. It can be done, but I am not particularly happy with it; it seems a bit too complex for my taste. If anyone has a better idea, please speak up. (And just in case you wonder why I've used a new "event ID" instead of re-using the existing "group" definition: I can easily imagine a scenario where you have e.g. multiple processes monitored with alerts going to one group of people (i.e. several PROC rules have the same GROUP setting), but you still want to track exactly which processes are up or down - and then you need a unique ID for each PROC rule). Regards, Henrik
list Scott Walters
▸
On 5/2/07, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
To do that, you would need to associate some "event ID" with each of the
settings that can cause a red/yellow status; e.g. you'd have
HOST=myhost
PROC tnslistener 1 ID=100
PROC httpd 4 ID=20Hmmmm....this is a bit of a bummer and inspirational. I was hoping some of the features of the bb-hosts grouping functions would allow "groups" of "individual checks." In big shops where you have many groups of people supporting systems is would be a big help, as well as logically I think a great benefit to "monitoring the right thing" and sharing configs. Example, ssh should get monitored. What does that mean? The sshd proc is up (and of course being trended for instances ;), a listener is on port 22, /var/log/secure does not have "sshd cannot bind" or "check_pass", file permissions/integrity on sshd, ssh-keysign are XYZ. Henrik, I like the idea of every "individual check" being assigned a "primary key"/eventid because then you could potentially do all this aggregation/grouping on the server and you are absolutely correct there could be individual checks needed for multiple "service" or groups. The security team and the DBA team. If the unique id was "hidden" and automatic that would be nice too, so that way us humans wouldn't need to keep the mappings in our heads. So "host proc" could be referenced and not "event id". So in the end maybe end up with something like this (please forgive any syntax slopiness/incorrectness) SERVICE=ssh PROC sshd PORT 22 LOGCHECK /var/log/secure "check_pass" SERVICE=mysql PROC safe_mysqld mysqld HOST=dbserver SERVICE ssh mysql Does this make any sense? Scott