Thoughts

4 messages in this thread

list Jason K. Kruse · Wed, 2 May 2007 14:06:34 -0500 ·

I'm going to be tasked to integrate hobbit with an event aggregator such
as IBM netcool or EMC Smarts sometime in the near future.  Most items
will be cake, such as ping, port, and filesystem tests.  Certain items,
however, will be problematic from what I've seen.  I'll describe the
problems to see if anyone has ideas about how they can be addressed.

The event aggregator does not want all status information for each item
monitored, only triggered events that cause a state change.  This is
easy to configure using a script in the alerts file that matches all
hosts for all colors.  When a test recovers a recovery is sent and it
disappears from our view.  

Grouped items, such as the process check and log monitors, are issues.
A single process down causes the whole check to go red.  A process
listed as alerting only operators can then mask another process on the
same system from notifying the DBA's.  Setting the alert repeat interval
to 0 shows the other problem, a recovery message is not generated for
each process that recovers, only when the whole group of processes
recovers.

The only way I've been able to wrap my head around this is to use a
database for the current state of all monitored processes and log files.
I tried separate alert groups with rules to distinguish each target of a
process separately but the rules match multiple times and lead to
confusion.  I tried using a channel on the hobbitd alert but the GROUP=
items from the configuration file do not get passed as associated to the
process, only on the page line.

Any ideas or feedback are appreciated.

Jason

list Henrik Størner · Wed, 2 May 2007 23:01:21 +0200 ·

▸ quoted from Jason K. Kruse

On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:

Grouped items, such as the process check and log monitors, are issues.
A single process down causes the whole check to go red.  A process
listed as alerting only operators can then mask another process on the
same system from notifying the DBA's.  Setting the alert repeat interval
to 0 shows the other problem, a recovery message is not generated for
each process that recovers, only when the whole group of processes
recovers.

This will be difficult to handle - it's a very basic thing in the Hobbit
design that it only tracks the color of each status, not the details of
which rule (out of many) causes e.g. the "procs" column to go red.

To do that, you would need to associate some "event ID" with each of the
settings that can cause a red/yellow status; e.g. you'd have

   HOST=myhost
       PROC tnslistener 1 ID=100
       PROC httpd 4 ID=200

The "procs" status would then store the set of ID's that had been triggered
for a status, and whenever there was a change in the set of triggered
rules it would pass this information to some process.

It can be done, but I am not particularly happy with it; it seems a bit too
complex for my taste. If anyone has a better idea, please speak up.

(And just in case you wonder why I've used a new "event ID" instead of
re-using the existing "group" definition: I can easily imagine a
scenario where you have e.g. multiple processes monitored with alerts
going to one group of people (i.e. several PROC rules have the same
GROUP setting), but you still want to track exactly which processes are
up or down - and then you need a unique ID for each PROC rule).


Regards,
Henrik

list Dan Vande More · Wed, 2 May 2007 15:09:11 -0600 ·

Indeed, it seems to me that the whole group concept is a good way to work
with us humans but breaks down wildly when dealing with computers. This is
fine because most of us use the groups to save space on the screens, and
configuration in the conf files.

If you want tests for each process and ultimately different behaviours for
each process, you need to be prepared to do the work and make the tests for
each process.

Please don't overcomplicate hobbit for this - it's a corner case and will
ultimately make the program more unwieldy.

▸ quoted from Henrik Størner


On 5/2/07, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:

On Wed, May 02, 2007 at 02:06:34PM -0500, Kruse, Jason K. wrote:

Grouped items, such as the process check and log monitors, are issues.
A single process down causes the whole check to go red.  A process
listed as alerting only operators can then mask another process on the
same system from notifying the DBA's.  Setting the alert repeat interval
to 0 shows the other problem, a recovery message is not generated for
each process that recovers, only when the whole group of processes
recovers.

This will be difficult to handle - it's a very basic thing in the Hobbit
design that it only tracks the color of each status, not the details of
which rule (out of many) causes e.g. the "procs" column to go red.

To do that, you would need to associate some "event ID" with each of the
settings that can cause a red/yellow status; e.g. you'd have

   HOST=myhost
       PROC tnslistener 1 ID=100
       PROC httpd 4 ID=200

The "procs" status would then store the set of ID's that had been
triggered
for a status, and whenever there was a change in the set of triggered
rules it would pass this information to some process.

It can be done, but I am not particularly happy with it; it seems a bit
too
complex for my taste. If anyone has a better idea, please speak up.

(And just in case you wonder why I've used a new "event ID" instead of
re-using the existing "group" definition: I can easily imagine a
scenario where you have e.g. multiple processes monitored with alerts
going to one group of people (i.e. several PROC rules have the same
GROUP setting), but you still want to track exactly which processes are
up or down - and then you need a unique ID for each PROC rule).


Regards,
Henrik

list Scott Walters · Thu, 3 May 2007 07:28:01 -0400 ·

▸ quoted from Dan Vande More

On 5/2/07, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:

To do that, you would need to associate some "event ID" with each of the
settings that can cause a red/yellow status; e.g. you'd have

   HOST=myhost
       PROC tnslistener 1 ID=100


       PROC httpd 4 ID=20

Hmmmm....this is a bit of a bummer and inspirational.  I was hoping
some of the features of the bb-hosts grouping functions would allow
"groups" of "individual checks."  In big shops where you have many
groups of people supporting systems is would be a big help, as well as
logically I think a great benefit to "monitoring the right thing" and
sharing configs.

Example, ssh should get monitored.  What does that mean?  The sshd
proc is up (and of course being trended for instances ;), a listener
is on port 22, /var/log/secure does not have "sshd cannot bind" or
"check_pass", file permissions/integrity on sshd, ssh-keysign are XYZ.

Henrik, I like the idea of every "individual check" being assigned a
"primary key"/eventid because then you could potentially do all this
aggregation/grouping on the server and you are absolutely correct
there could be individual checks needed for multiple "service" or
groups.  The security team and the DBA team.  If the unique id was
"hidden" and automatic that would be nice too, so that way us humans
wouldn't need to keep the mappings in our heads.  So "host proc" could
be referenced and not "event id".

So in the end maybe end up with something like this (please forgive
any syntax slopiness/incorrectness)

SERVICE=ssh
 PROC sshd
 PORT 22
 LOGCHECK /var/log/secure "check_pass"

SERVICE=mysql
  PROC safe_mysqld mysqld

HOST=dbserver
  SERVICE ssh mysql

Does this make any sense?

Scott

Thoughts 🔗 link

Thoughts