Yellow->red escalation, bug or feature?

21 messages in this thread

list Elizabeth Schwartz · Mon, 9 Jan 2012 10:07:06 -0500 ·

I think this is a bug, but maybe it's a feature I haven't figured out yet:

Many of our alerts are set to email on yellow and page with escalation
on red: alert1 after 10 minutes (repeat every 10), alert2 after 20
minutes , alert3 after 40 minutes,  alert4 after an hour. When an
alert is yellow, it sometimes sits around for a while. When an alert
goes red, the alert1 person acks or fixes the alert and the alert4
person should never be woken up.

However, when an alert has been yellow for over an hour and *then*
turns red, we are seeing that the entire escalation group is paged, as
though the alert has been red for over an hour.

I think this is a bug  - when the alert first goes red it should be
treated as a NEW alert and not go waking up everyone.

Thoughts? Am I missing something?
Our tier4 person is getting rather annoyed at being woken up for
things that the tier1 person can handle.

thanks Betsy

list Josh Luthman · Mon, 9 Jan 2012 10:11:58 -0500 ·

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX


On Mon, Jan 9, 2012 at 10:07 AM, Elizabeth Schwartz

▸ quoted from Elizabeth Schwartz

<user-c61747246f66@xymon.invalid> wrote:

I think this is a bug, but maybe it's a feature I haven't figured out yet:

Many of our alerts are set to email on yellow and page with escalation
on red: alert1 after 10 minutes (repeat every 10), alert2 after 20
minutes , alert3 after 40 minutes,  alert4 after an hour. When an
alert is yellow, it sometimes sits around for a while. When an alert
goes red, the alert1 person acks or fixes the alert and the alert4
person should never be woken up.

However, when an alert has been yellow for over an hour and *then*
turns red, we are seeing that the entire escalation group is paged, as
though the alert has been red for over an hour.

I think this is a bug  - when the alert first goes red it should be
treated as a NEW alert and not go waking up everyone.

Thoughts? Am I missing something?
Our tier4 person is getting rather annoyed at being woken up for
things that the tier1 person can handle.

thanks Betsy

list Elizabeth Schwartz · Mon, 9 Jan 2012 11:15:26 -0500 ·

On Mon, Jan 9, 2012 at 10:11 AM, Josh Luthman

▸ quoted from Josh Luthman

<user-4c45a83f15cb@xymon.invalid> wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

Exactly. Red for five minutes, anyway :-)  At least some of the time,
I think there's a counter that isn't reset.

list Josh Luthman · Mon, 9 Jan 2012 11:20:29 -0500 ·

What version is this?  I don't think I've got that bug.

▸ quoted from Josh Luthman


Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX


On Mon, Jan 9, 2012 at 11:15 AM, Elizabeth Schwartz

▸ quoted from Elizabeth Schwartz

<user-c61747246f66@xymon.invalid> wrote:

On Mon, Jan 9, 2012 at 10:11 AM, Josh Luthman
<user-4c45a83f15cb@xymon.invalid> wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

Exactly. Red for five minutes, anyway :-)  At least some of the time,
I think there's a counter that isn't reset.

list Elizabeth Schwartz · Mon, 9 Jan 2012 13:56:19 -0500 ·

I am on 4.3.7 now; saw the behavior on earlier 4.3.x versions.

On Mon, Jan 9, 2012 at 11:20 AM, Josh Luthman
<user-4c45a83f15cb@xymon.invalid> wrote:

What version is this?  I don't think I've got that bug.

Here's the most recent test that got everyone annoyed

Sat Jan 07 09:02:06 2012 	green 	2 days 4:48:10
Sat Jan 07 08:16:59 2012 	red 	0:45:07
Sat Jan 07 07:16:50 2012 	yellow 	1:00:09
Sat Jan 07 03:16:16 2012 	green 	4:00:34
Sat Jan 07 02:56:13 2012 	red 	0:20:03
Sat Jan 07 01:56:04 2012 	yellow 	1:00:09

notifications sent:

Sat Jan  7 01:56:04 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) user-e49400df80ec@xymon.invalid[139] 1325919364 0
Sat Jan  7 02:57:17 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) user-e49400df80ec@xymon.invalid[139] 1325923037 0
Sat Jan  7 02:57:17 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert1[149] 1325923037 0
Sat Jan  7 02:57:17 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert2[152] 1325923037 0
Sat Jan  7 02:57:17 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert3[153] 1325923037 0
Sat Jan  7 02:57:17 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert4[154] 1325923037 0
Sat Jan  7 03:07:21 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert1[149] 1325923641 0
Sat Jan  7 03:07:21 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert2[152] 1325923641 0
Sat Jan  7 03:07:21 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert3[153] 1325923641 0
Sat Jan  7 03:07:21 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert4[154] 1325923641 0
Sat Jan  7 07:16:53 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) user-e49400df80ec@xymon.invalid[139] 1325938613 0
Sat Jan  7 08:17:13 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) user-e49400df80ec@xymon.invalid[139] 1325942233 0
Sat Jan  7 08:17:13 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert1[149] 1325942233 0
Sat Jan  7 08:17:13 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert2[152] 1325942233 0
Sat Jan  7 08:17:13 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert3[153] 1325942233 0
Sat Jan  7 08:17:13 2012 edprocs3.example.com.watch_oelogs
(10.100.4.57) alert4[154] 1325942233 0


You can see on Saturday it went yellow at 1:56 , emailing
"user-e49400df80ec@xymon.invalid" which is our email alert, and then an hour later
it went red and started emailing the world. Then at 7:00 am the same
thing.

I note that all of these servers are on EDT, and this test went red
exactly an hour after going yellow because it's a custom test that
goes yellow after so many seconds and red an hour later.

list Elizabeth Schwartz · Mon, 9 Jan 2012 14:12:38 -0500 ·

▸ quoted from Josh Luthman

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

I note that the previous example was for a custom test but I also have
seen this for the disk test:
(set to email  every 8 hours when yellow)


Sat Dec 24 10:53:27 2011 	red 	0:49:09
Sun Dec 18 03:01:51 2011 	yellow 	6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324594479 100
Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324623280 100
Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324652087 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert1[149] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert2[152] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert3[153] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert4[154] 1324742067 100

list Ryan Skadberg · Mon, 9 Jan 2012 15:23:18 -0500 ·

I've seen this exact same issue going all the way back to hobbit, so this
is not a new issue with 4.3.  I would love to see it fixed though, as it's
very annoying to get paged when you are second or third on call and
everyone gets notified on the first red.

Skadz


On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <user-c61747246f66@xymon.invalid

▸ quoted from Elizabeth Schwartz

wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

I note that the previous example was for a custom test but I also have
seen this for the disk test:
(set to email  every 8 hours when yellow)


Sat Dec 24 10:53:27 2011        red     0:49:09
Sun Dec 18 03:01:51 2011        yellow  6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324594479 100
Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324623280 100
Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324652087 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert1[149] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert2[152] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert3[153] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert4[154] 1324742067 100

list Sebastian Auriol · Tue, 10 Jan 2012 11:24:11 -0000 ·

I agree that this is not a new issue.  I have discussed this before
(http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply:
http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and
http://lists.xymon.com/archive/2008-September/020998.html).
 
But now that we have flap detection, I'm not sure that Henrik's listed
problem with changing it is really an issue.  So I hope it can be changed!
 
BTW, The oldarchive is better for following threads (provided they don't
cross month boundaries):
http://lists.xymon.com/oldarchive/2008/09/msg00057.html
Compare with the previous link.  However, the new archive keeps attachments.
It would be nice if the functionality of both archives were merged...

Kind regards, 

SebA


From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of
Ryan Skadberg
Sent: 09 January 2012 20:23
To: Xymon at xymon.com
Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

▸ quoted from Ryan Skadberg



I've seen this exact same issue going all the way back to hobbit, so this is
not a new issue with 4.3.  I would love to see it fixed though, as it's very
annoying to get paged when you are second or third on call and everyone gets
notified on the first red. 

Skadz


On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz
<user-c61747246f66@xymon.invalid> wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?


I note that the previous example was for a custom test but I also have
seen this for the disk test:
(set to email  every 8 hours when yellow)


Sat Dec 24 10:53:27 2011        red     0:49:09
Sun Dec 18 03:01:51 2011        yellow  6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324594479 100
Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324623280 100
Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324652087 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert1[149] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert2[152] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert3[153] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert4[154] 1324742067 100

list Elizabeth Schwartz · Tue, 10 Jan 2012 14:15:17 -0500 ·

Me too. It's a very serious problem for us.

We need to avoid waking up the tier4 people!

It attracts a ton of attention when we have a small issue waking up
everyone in the house.

list Carl Melgaard · Wed, 11 Jan 2012 10:56:23 +0100 ·

Hi,

It would be interesting to see if this bug could be squashed, now that flap-detection is in the game. But I haven't seen Henrik on this list for a good time now - he's active on the developer-list, tho - so I'm crossposting it there.

Regards,

Carl Melgaard

▸ quoted from Sebastian Auriol

Fra: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] På vegne af SebA
Sendt: 10. januar 2012 12:24
Til: Xymon at xymon.com
Emne: Re: [Xymon] Yellow->red escalation, bug or feature?

I agree that this is not a new issue.  I have discussed this before (http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply: http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and http://lists.xymon.com/archive/2008-September/020998.html).

But now that we have flap detection, I'm not sure that Henrik's listed problem with changing it is really an issue.  So I hope it can be changed!

BTW, The oldarchive is better for following threads (provided they don't cross month boundaries):
http://lists.xymon.com/oldarchive/2008/09/msg00057.html
Compare with the previous link.  However, the new archive keeps attachments.  It would be nice if the functionality of both archives were merged...

Kind regards,

SebA

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Ryan Skadberg
Sent: 09 January 2012 20:23
To: Xymon at xymon.com
Subject: Re: [Xymon] Yellow->red escalation, bug or feature?
I've seen this exact same issue going all the way back to hobbit, so this is not a new issue with 4.3.  I would love to see it fixed though, as it's very annoying to get paged when you are second or third on call and everyone gets notified on the first red.

Skadz

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <user-c61747246f66@xymon.invalid<mailto:user-c61747246f66@xymon.invalid>> wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

I note that the previous example was for a custom test but I also have
seen this for the disk test:
(set to email  every 8 hours when yellow)


Sat Dec 24 10:53:27 2011        red     0:49:09
Sun Dec 18 03:01:51 2011        yellow  6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)


user-e49400df80ec@xymon.invalid<mailto:user-e49400df80ec@xymon.invalid>[139] 1324594479 100
Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid<mailto:user-e49400df80ec@xymon.invalid>[139] 1324623280 100
Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid<mailto:user-e49400df80ec@xymon.invalid>[139] 1324652087 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid<mailto:user-e49400df80ec@xymon.invalid>[139] 1324742067 100

▸ quoted from Sebastian Auriol

Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert1[149] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert2[152] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert3[153] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert4[154] 1324742067 100

list Ryan Novosielski · Wed, 11 Jan 2012 10:30:19 -0500 ·

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've seen him post in the last week, so I know he does read this list
periodically at least.

▸ quoted from Carl Melgaard


On 01/11/2012 04:56 AM, Carl Melgaard wrote:

Hi,

It would be interesting to see if this bug could be squashed, now that
flap-detection is in the game. But I haven?t seen Henrik on this list
for a good time now ? he?s active on the developer-list, tho ? so I?m
crossposting it there.


*Fra:*xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] *På vegne

▸ quoted from Carl Melgaard

af *SebA
*Sendt:* 10. januar 2012 12:24
*Til:* Xymon at xymon.com
*Emne:* Re: [Xymon] Yellow->red escalation, bug or feature?

 
I agree that this is not a new issue.  I have discussed this before
(http://lists.xymon.com/archive/2009-January/023201.html (Henrik's
reply: http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and
http://lists.xymon.com/archive/2008-September/020998.html).

 
But now that we have flap detection, I'm not sure that Henrik's listed
problem with changing it is really an issue.  So I hope it can be changed!

 
BTW, The oldarchive is better for following threads (provided they don't
cross month boundaries):

http://lists.xymon.com/oldarchive/2008/09/msg00057.html

Compare with the previous link.  However, the new archive keeps
attachments.  It would be nice if the functionality of both
archives were merged...

Kind regards,

SebA


    *From:*xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] *On

▸ quoted from Carl Melgaard

    Behalf Of *Ryan Skadberg
    *Sent:* 09 January 2012 20:23
    *To:* Xymon at xymon.com
    *Subject:* Re: [Xymon] Yellow->red escalation, bug or feature?

    I've seen this exact same issue going all the way back to hobbit, so
    this is not a new issue with 4.3.  I would love to see it fixed
    though, as it's very annoying to get paged when you are second or
    third on call and everyone gets notified on the first red.

     
    Skadz

     
    On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz
    <user-c61747246f66@xymon.invalid <mailto:user-c61747246f66@xymon.invalid>> wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

    I note that the previous example was for a custom test but I also have
    seen this for the disk test:
    (set to email  every 8 hours when yellow)


    Sat Dec 24 10:53:27 2011        red     0:49:09
    Sun Dec 18 03:01:51 2011        yellow  6 days 7:51:36

    Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)


    user-e49400df80ec@xymon.invalid <mailto:user-e49400df80ec@xymon.invalid>[139] 1324594479 100
    Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
    user-e49400df80ec@xymon.invalid <mailto:user-e49400df80ec@xymon.invalid>[139] 1324623280 100
    Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
    user-e49400df80ec@xymon.invalid <mailto:user-e49400df80ec@xymon.invalid>[139] 1324652087 100
    Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
    user-e49400df80ec@xymon.invalid <mailto:user-e49400df80ec@xymon.invalid>[139] 1324742067 100

▸ quoted from Carl Melgaard

    Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
    alert1[149] 1324742067 100
    Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
    alert2[152] 1324742067 100
    Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
    alert3[153] 1324742067 100
    Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
    alert4[154] 1324742067 100

- -- - ---- _  _ _  _ ___  _  _  _


|Y#| |  | |\/| |  \ |\ |  | |Ryan Novosielski - Sr. Systems Programmer
|$&| |__| |  | |__/ | \| _| |user-ae4522577e16@xymon.invalid - 973/972.0922 (2-0922)
\__/ Univ. of Med. and Dent.|IST/EI-Academic Svcs. - ADMC 450, Newark
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8NqwsACgkQmb+gadEcsb6e8wCePPnQ2/d+zUtSmWft/2GezsRp
WIkAnAqhJAKUKqqQddv5rFwXO2g6hN1Q
=pvLI
-----END PGP SIGNATURE-----

list David W David Gore · Wed, 11 Jan 2012 14:53:23 -0500 ·

Since it has been argued that it is not exactly a bug I would only humbly request that the current behavior is not changed but enhanced for those who want it to work differently.   If an alert has been alarming for x time and then goes red do you want to wait even longer to be alerted.  Yellow time + red time or yellow time and now its red so alert, provided the yellow time exceeds the red threshold.


~David

▸ quoted from Carl Melgaard

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Carl Melgaard
Sent: Wednesday, January 11, 2012 04:56
To: 'xymon at xymon.com'
Cc: 'user-834d44be5e50@xymon.invalid'
Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

Hi,

It would be interesting to see if this bug could be squashed, now that flap-detection is in the game. But I haven't seen Henrik on this list for a good time now - he's active on the developer-list, tho - so I'm crossposting it there.

Regards,

Carl Melgaard
Fra: xymon-bounces at xymon.com<mailto:xymon-bounces at xymon.com> [mailto:xymon-bounces at xymon.com] På vegne af SebA
Sendt: 10. januar 2012 12:24
Til: Xymon at xymon.com<mailto:Xymon at xymon.com>
Emne: Re: [Xymon] Yellow->red escalation, bug or feature?

I agree that this is not a new issue.  I have discussed this before (http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply: http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and http://lists.xymon.com/archive/2008-September/020998.html).

But now that we have flap detection, I'm not sure that Henrik's listed problem with changing it is really an issue.  So I hope it can be changed!

BTW, The oldarchive is better for following threads (provided they don't cross month boundaries):
http://lists.xymon.com/oldarchive/2008/09/msg00057.html
Compare with the previous link.  However, the new archive keeps attachments.  It would be nice if the functionality of both archives were merged...

Kind regards,

SebA

From: xymon-bounces at xymon.com<mailto:xymon-bounces at xymon.com> [mailto:xymon-bounces at xymon.com] On Behalf Of Ryan Skadberg
Sent: 09 January 2012 20:23
To: Xymon at xymon.com<mailto:Xymon at xymon.com>
Subject: Re: [Xymon] Yellow->red escalation, bug or feature?
I've seen this exact same issue going all the way back to hobbit, so this is not a new issue with 4.3.  I would love to see it fixed though, as it's very annoying to get paged when you are second or third on call and everyone gets notified on the first red.

Skadz

On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz <user-c61747246f66@xymon.invalid<mailto:user-c61747246f66@xymon.invalid>> wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

I note that the previous example was for a custom test but I also have
seen this for the disk test:
(set to email  every 8 hours when yellow)


Sat Dec 24 10:53:27 2011        red     0:49:09
Sun Dec 18 03:01:51 2011        yellow  6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid<mailto:user-e49400df80ec@xymon.invalid>[139] 1324594479 100
Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid<mailto:user-e49400df80ec@xymon.invalid>[139] 1324623280 100
Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid<mailto:user-e49400df80ec@xymon.invalid>[139] 1324652087 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid<mailto:user-e49400df80ec@xymon.invalid>[139] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert1[149] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert2[152] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert3[153] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert4[154] 1324742067 100

list Josh Luthman · Wed, 11 Jan 2012 14:55:57 -0500 ·

I think we need a new argument for this new condition, something like
DURATIONWHILERED

▸ quoted from Josh Luthman


Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX


On Wed, Jan 11, 2012 at 2:53 PM, Gore, David W (David)

▸ quoted from David W David Gore

<user-368fd67cc6bd@xymon.invalid> wrote:

Since it has been argued that it is not exactly a bug I would only humbly
request that the current behavior is not changed but enhanced for those who
want it to work differently.   If an alert has been alarming for x time and
then goes red do you want to wait even longer to be alerted.  Yellow time +
red time or yellow time and now its red so alert, provided the yellow time
exceeds the red threshold.


~David


From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of
Carl Melgaard
Sent: Wednesday, January 11, 2012 04:56
To: 'xymon at xymon.com'
Cc: 'user-834d44be5e50@xymon.invalid'

Subject: Re: [Xymon] Yellow->red escalation, bug or feature?


Hi,


It would be interesting to see if this bug could be squashed, now that
flap-detection is in the game. But I haven’t seen Henrik on this list for a
good time now – he’s active on the developer-list, tho – so I’m crossposting
it there.


Regards,


Carl Melgaard


Fra: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] På vegne af
SebA
Sendt: 10. januar 2012 12:24
Til: Xymon at xymon.com
Emne: Re: [Xymon] Yellow->red escalation, bug or feature?


I agree that this is not a new issue.  I have discussed this before
(http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply:
http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and
http://lists.xymon.com/archive/2008-September/020998.html).


But now that we have flap detection, I'm not sure that Henrik's listed
problem with changing it is really an issue.  So I hope it can be changed!


BTW, The oldarchive is better for following threads (provided they don't
cross month boundaries):

http://lists.xymon.com/oldarchive/2008/09/msg00057.html

Compare with the previous link.  However, the new archive keeps
attachments.  It would be nice if the functionality of both archives were
merged...

Kind regards,

SebA


From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of
Ryan Skadberg
Sent: 09 January 2012 20:23
To: Xymon at xymon.com
Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

I've seen this exact same issue going all the way back to hobbit, so this is
not a new issue with 4.3.  I would love to see it fixed though, as it's very
annoying to get paged when you are second or third on call and everyone gets
notified on the first red.


Skadz


On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz
<user-c61747246f66@xymon.invalid> wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

I note that the previous example was for a custom test but I also have
seen this for the disk test:
(set to email  every 8 hours when yellow)


Sat Dec 24 10:53:27 2011        red     0:49:09
Sun Dec 18 03:01:51 2011        yellow  6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324594479 100
Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324623280 100
Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324652087 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert1[149] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert2[152] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert3[153] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert4[154] 1324742067 100

list Elizabeth Schwartz · Wed, 11 Jan 2012 15:03:33 -0500 ·

If an alert's been yellow for a while and then goes red I do want to
be alerted - but I only want the tier1 person to be alerted.

The current behavior of immediately paging all the way up the food
chain to the tier4 people, the minute it goes red,  seems wrong to me
- and it is REALLY upsetting our tier4 people who are getting woken up
at 3am for stuff the tier1 person can handle.

(This is happening for us most often with disk space. People are not
super-fast at cleaning up disk space. But I'm waking up managers for
disks that have hit 90% full and that's just not cool)

If other people like the behavior, making it a knob we can turn is
fine. Just something I can do to keep from waking the whole crew up.


On Wed, Jan 11, 2012 at 2:55 PM, Josh Luthman

▸ quoted from Josh Luthman

<user-4c45a83f15cb@xymon.invalid> wrote:

I think we need a new argument for this new condition, something like
DURATIONWHILERED

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX


On Wed, Jan 11, 2012 at 2:53 PM, Gore, David W (David)
<user-368fd67cc6bd@xymon.invalid> wrote:

Since it has been argued that it is not exactly a bug I would only humbly
request that the current behavior is not changed but enhanced for those who
want it to work differently.   If an alert has been alarming for x time and
then goes red do you want to wait even longer to be alerted.  Yellow time +
red time or yellow time and now its red so alert, provided the yellow time
exceeds the red threshold.


~David


From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of
Carl Melgaard
Sent: Wednesday, January 11, 2012 04:56
To: 'xymon at xymon.com'
Cc: 'user-834d44be5e50@xymon.invalid'

Subject: Re: [Xymon] Yellow->red escalation, bug or feature?


Hi,


It would be interesting to see if this bug could be squashed, now that
flap-detection is in the game. But I haven’t seen Henrik on this list for a
good time now – he’s active on the developer-list, tho – so I’m crossposting
it there.


Regards,


Carl Melgaard


Fra: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] På vegne af
SebA
Sendt: 10. januar 2012 12:24
Til: Xymon at xymon.com
Emne: Re: [Xymon] Yellow->red escalation, bug or feature?


I agree that this is not a new issue.  I have discussed this before
(http://lists.xymon.com/archive/2009-January/023201.html (Henrik's reply:
http://lists.xymon.com/oldarchive/2009/02/msg00133.html) and
http://lists.xymon.com/archive/2008-September/020998.html).


But now that we have flap detection, I'm not sure that Henrik's listed
problem with changing it is really an issue.  So I hope it can be changed!


BTW, The oldarchive is better for following threads (provided they don't
cross month boundaries):

http://lists.xymon.com/oldarchive/2008/09/msg00057.html

Compare with the previous link.  However, the new archive keeps
attachments.  It would be nice if the functionality of both archives were
merged...

Kind regards,

SebA


From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of
Ryan Skadberg
Sent: 09 January 2012 20:23
To: Xymon at xymon.com
Subject: Re: [Xymon] Yellow->red escalation, bug or feature?

I've seen this exact same issue going all the way back to hobbit, so this is
not a new issue with 4.3.  I would love to see it fixed though, as it's very
annoying to get paged when you are second or third on call and everyone gets
notified on the first red.


Skadz


On Mon, Jan 9, 2012 at 2:12 PM, Elizabeth Schwartz
<user-c61747246f66@xymon.invalid> wrote:

You're saying yellow for an hour and red for a few seconds triggers
like it was red for an hour?

I note that the previous example was for a custom test but I also have
seen this for the disk test:
(set to email  every 8 hours when yellow)


Sat Dec 24 10:53:27 2011        red     0:49:09
Sun Dec 18 03:01:51 2011        yellow  6 days 7:51:36

Thu Dec 22 17:54:39 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324594479 100
Fri Dec 23 01:54:40 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324623280 100
Fri Dec 23 09:54:47 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324652087 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
user-e49400df80ec@xymon.invalid[139] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert1[149] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert2[152] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert3[153] 1324742067 100
Sat Dec 24 10:54:27 2011 jumpstart.example.com.disk (10.100.4.33)
alert4[154] 1324742067 100

list Mark Hinkle · Wed, 11 Jan 2012 12:16:24 -0800 ·

I think there's a counter that isn't reset.

Just guessing, but I would say you are close. Seems more like there is a counter missing. As mentioned in the old discussion included in a previous email, there is a single alert duration clock when there really needs to be both yellow and red clocks. Alert state issue again, maybe? See my comments at the bottom about another long-standing "lack of alert state" issue.

One possible non-pretty, non-scalable work-around for your issue would be to create a "red" test, i.e. diskred, that only has red-level thresholds and alerts config, and take the red alerts config off of the non-red test (but leave the red threshold). This would give you the correct red duration for your red-level paging alerts. You could use bb-hosts tricks like NOPROPRED, etc. to not show this "red" test on the web pages if you didn't want to. The non-red test would still go yellow and red so you would see it on the web, it just wouldn't be doing the red paging. Like I said, not pretty, but possibly better than the false positives you are getting. Possibly.

If the powers-that-be are willing to open the question of "alert state", then please, please also look into the long standing recovery message issue. Specifically, if you are emailing on yellow and paging on red, a test that goes green->yellow->red->yellow->green will result in a red page but only an email recovery. See http://lists.xymon.com/archive/2008-July/020107.html and http://lists.xymon.com/archive/2008-July/020152.html. Apologies if this seems like a thread hijack, that is not the intent at all, but rather these issues seem very closely related with respect to maintaining alert state and to what degree.

--
Mark L. Hinkle
user-9816e24cee8c@xymon.invalid

list Henrik Størner · Wed, 11 Jan 2012 21:28:01 +0100 ·

▸ quoted from Carl Melgaard

On 11-01-2012 10:56, Carl Melgaard wrote:

It would be interesting to see if this bug could be squashed, now that
flap-detection is in the game. But I haven’t seen Henrik on this list
for a good time now – he’s active on the developer-list, tho – so I’m
crossposting it there.

As I wrote to a couple of others, I would appreciate it if you do not crosspost to the developer list - it really isn't on-topic. Mail me directly if you feel there is some discussion on the mailing list that I may have missed.


Regards,
Henrik

list Henrik Størner · Wed, 11 Jan 2012 22:39:02 +0100 ·

▸ quoted from Elizabeth Schwartz

On 11-01-2012 20:53, Gore, David W (David) wrote:

Since it has been argued that it is not exactly a bug I would only
humbly request that the current behavior is not changed but enhanced for
those who want it to work differently.   If an alert has been alarming
for x time and then goes red do you want to wait even longer to be
alerted.  Yellow time + red time or yellow time and now its red so
alert, provided the yellow time exceeds the red threshold.

If I understand it correctly, then the unhappiness with the current setup is that the DURATION setting in alerts.cfg counts both yellow and red time. So when a status goes yellow, stays there for a few hours time before going red - then a rule such as

MAIL user-cb34797ee457@xymon.invalid COLOR=RED DURATION>3h

will trigger immediately.

Some would argue that if you haven't fixed a problem before it goes critical, then your CIO *should* be notified.

The other school of thought argues that this rule means the CIO only wants to be informed when something has been really hosed for at least three hours. So the yellow warning-time shouldn't count when evaluating the DURATION setting for that rule - only the critical time counts.

Is that a correct understanding of the arguments here ?

Let's say I implement the 3-hour delay before sending an escalation notice. What should happen if the status is yellow for two hours, then goes red for 2h50m, dips back into yellow for 10 minutes and then goes back to red ? Should the 2h50m count after the status was yellow for a while? Or does a 10 minute yellow status completely reset the duration counter for the almost-3-hours red status?

I'm not trying to be too pedantic here, but it is the sort of things that do happen. So let's discuss how it can best be handled.

I think Josh is right that changing this will require some sort of additional configuration setting to indicate that "this duration value applies to the time it's been red only". It's for curbing escalation notices. And therefore it is obviously only an issue for those statuses that can be yellow - not those that can only be red or green.

It's been quite some time since I last dug into the alert-module code, so I cannot say how much effort it will take to add this. Right now I am not sure if the alert module has enough information about an alert to be able to implement it.

Meanwhile, may I draw your attention to the "SCRIPT" way of sending alerts. It's not an ideal solution, but I think it's a usable work-around for this problem:

The alert script gets triggered just the same as your MAIL alerts do. But your script can query xymond to see when the status last changed (to red, presumably) - it's the "lastchange" field stored for a status. So you could put something like this in your alert script:

#!/bin/sh

# This script only handles red
if test "$BBCOLORLEVEL" != "red"
then
exit 0
fi

REDSTART=`xymon 127.0.0.1 "xymondlog $BBHOSTNAME.$BBSVCNAME fields=lastchange" | head -n 1`
NOW=`date +%s`
REDDURATION=`expr $NOW - $REDSTART`
if test $REDDURATION -lt 10800 # 3-hour (10800 secs) delay
then
exit 0
fi

... send the alert ...

(the "head -n 1" is needed, because xymondlog also sends you the full status message. On the other hand, that might be useful when generating the alert message).

Regards,
Henrik

list Henrik Størner · Wed, 11 Jan 2012 22:47:19 +0100 ·

▸ quoted from Mark Hinkle

On 11-01-2012 21:16, Mark Hinkle wrote:

I think there's a counter that isn't reset.

Just guessing, but I would say you are close. Seems more like there is a
counter missing. As mentioned in the old discussion included in a
previous email, there is a single alert duration clock when there really
needs to be both yellow and red clocks. Alert state issue again, maybe?

[snip]

If the powers-that-be are willing to open the question of "alert state",
then please, please also look into the long standing recovery message
issue. Specifically, if you are emailing on yellow and paging on red, a
test that goes green->yellow->red->yellow->green will result in a red
page but only an email recovery. See
http://lists.xymon.com/archive/2008-July/020107.html and
http://lists.xymon.com/archive/2008-July/020152.html. Apologies if this
seems like a thread hijack, that is not the intent at all, but rather
these issues seem very closely related with respect to maintaining alert
state and to what degree.

You're probably quite correct that there is some state information in 
the alert module that does not keep track of "enough" state to handle 
both of these feature requests. So it would make sense to look at them 
at the same time.


Regards,
Henrik

list Sebastian Auriol · Thu, 12 Jan 2012 12:07:14 -0000 ·

▸ quoted from Henrik Størner

xymon-bounces at xymon.com wrote:

On 11-01-2012 20:53, Gore, David W (David) wrote:

Since it has been argued that it is not exactly a bug I would only
humbly request that the current behavior is not changed but enhanced
for those who want it to work differently.   If an alert has been
alarming for x time and then goes red do you want to wait even
longer to be alerted.  Yellow time + red time or yellow time and now
its red so alert, provided the yellow time exceeds the red threshold.

Yes, I do want to wait even longer.  I want to wait for the duration that
was specified in the alert rule, for the colour that was specified in the
alert rule.  And I think this is how one would expect xymond_alert to behave
given the syntax of the rule, with no prior knowledge of Xymon (and not
having read the documentation).

▸ quoted from Henrik Størner

If I understand it correctly, then the unhappiness with the current
setup is that the DURATION setting in alerts.cfg counts both
yellow and
red time. So when a status goes yellow, stays there for a few
hours time
before going red - then a rule such as

    MAIL user-cb34797ee457@xymon.invalid COLOR=RED DURATION>3h

will trigger immediately.


Some would argue that if you haven't fixed a problem before it goes
critical, then your CIO *should* be notified.

Sounds like, for people who want that behaviour, they need a (yet to be
implemented) WARNINGDURATION> rule.  This implies that tier1 support
probably get alerts on yellows, which I expect could result in a lot of
false positive alerts for them!  But if that's how they want it, that's
their affair.

▸ quoted from Henrik Størner

The other school of thought argues that this rule means the CIO only
wants to be informed when something has been really hosed for
at least
three hours. So the yellow warning-time shouldn't count when
evaluating the DURATION setting for that rule - only the critical
time counts. 


Is that a correct understanding of the arguments here ?

Yes.

▸ quoted from Henrik Størner

Let's say I implement the 3-hour delay before sending an escalation
notice. What should happen if the status is yellow for two
hours, then
goes red for 2h50m, dips back into yellow for 10 minutes and
then goes
back to red ? Should the 2h50m count after the status was
yellow for a
while? Or does a 10 minute yellow status completely reset the duration
counter for the almost-3-hours red status?

I already responded to this issue in my old post here:
http://lists.xymon.com/oldarchive/2009/02/msg00145.html, but I'll quote the
relevant part:

"...since this test can flap between yellow and red and I consider
yellow to be a sufficient degree of recovery that I don't want another alert
as soon as it goes red again. If we look at disk in particular though,
surely if it is flapping between yellow and red the problem isn't too
serious. If one does want an alert for this, one can eliminate the DURATION
rule. If one does not, the DURATION rule should be a way of preventing
getting alerts for the flapping behaviour. This is what I've always
considered the use of the DURATION rule (although I was wrong given the way
it is currently working)."

▸ quoted from Henrik Størner

I'm not trying to be too pedantic here, but it is the sort of things
that do happen. So let's discuss how it can best be handled.


I think Josh is right that changing this will require some sort of
additional configuration setting to indicate that "this
duration value
applies to the time it's been red only". It's for curbing escalation
notices. And therefore it is obviously only an issue for
those statuses
that can be yellow - not those that can only be red or green.

Continuing my quote from my old post:
"Perhaps a more flexible and useful solution, while
still remaining easy to use, is to incorporate the change you suggest
[which was (quote Henrik): "What would probably be best was for Xymon to
calculate the duration based on the COLOR-settings defined for the alert"]
with a RECOVERY= rule in the alerts. So each rule can specify what colour
consistutes a recovery. This means that some tests can have yellow while
others have green, allowing for different alerting behaviour for flapping
depending on the test, and it also allows those who get notified of
recoveries to have this information when they want. :)"

<snip>

Regards,
Henrik

And, at the risk of dirtying this thread, a closely related issue is my
original post in the same thread:
http://lists.xymon.com/oldarchive/2009/01/msg00364.html
Quote:
"It seems the combination of TIME=W:0845:2355 and DURATION>15 in
hobbit-alerts.cfg means the earliest an alert can be sent out is 9 am.  Is
this what you would expect?  I would have expected these two rules to mean
the test should be in an alarm colour for more than 15 minutes and be
between the times of 08:45 and 23:55, weekdays.  Instead it seems to be
relating the DURATION with the time such that the DURATION only applies
_during_ the TIME."

So, if the CIO has a DURATION > 3 hours for a particular alert and a global
TIME=W:0845:2355 (to retain their beauty sleep) he (or she) will only get
the alert after 11:45 am.  Might not be what they want.

Kind regards,

SebA

list Elizabeth Schwartz · Sat, 14 Jan 2012 17:49:47 -0500 ·

Exactly. If something is yellow, by definition, we've said it's NOT critical.

Our most frequent example is disk space. A disk which fills up 100%
will cause a critical disruption to production. On many disks we go
yellow at 80%, to give ourselves plenty of warning, and red at 95%.
Now when a disk goes red, I do want someone to look at it immediately,
but it doesn't really matter that it's been yellow for a long time. In
fact, the LONGER it's been yellow the LESS urgent it is, because it's
not filling up very quickly. Our senior team does NOT want to be paged
for this!


If I wanted something to page when it's been yellow for three hours,
I've already got the capability of paging after it's been yellow for
three hours.

When something turns red, I want to follow the rules and timing for reds

Let's say I implement the 3-hour delay before sending an escalation notice. What should happen if the status is yellow for two hours, >then goes red for 2h50m, dips back into yellow for 10 minutes and then goes back to red ? Should the 2h50m count after the status >was yellow for a while? Or does a 10 minute yellow status completely reset the duration counter for the almost-3-hours red status?

This case doesn't make a lot of sense to me. If something's been red
for 2h50, I've probably already escalated it up to the hilt. The above
scenario is only a problem in the case where a red alert is set to be
ignored for the first three hours. I don't think that's a common
scenario. Anything we could ignore for 3 hours is probably a yellow.

Having to write a custom test for every single red in our environment
doesn't seem like a good alternative, especially for the built-in
tests.

list Elizabeth Schwartz · Mon, 6 Feb 2012 10:07:04 -0500 ·

▸ quoted from Elizabeth Schwartz

Let's say I implement the 3-hour delay before sending an escalation notice. What should happen
if the status is yellow for two hours, >then goes red for 2h50m, dips back into yellow for 10 minutes
 and then goes back to red ? Should the 2h50m count after the status >was yellow for a while? Or
does a 10 minute yellow status completely reset the duration counter for the almost-3-hours red status?

Thinking about this again (since xymon woke everyone up again this
morning) I'm liking the idea of a RECOVERY= flag.

Seems like there are two kinds of alerts, those where yellow->
red->yellow means things are not so bad (like disk space, which is the
one I keep hitting) and those where yellow->red->yellow means you are
looking at a larger performance problem (like, say, CPU load or other
performance metrics). Being able to treat those two situations
separately would be the biggest win.

thanks Betsy

Yellow->red escalation, bug or feature? 🔗 link

Yellow->red escalation, bug or feature?