Xymon Mailing List Archive search

Feature request - thresholds for CPU utilisation (not load average)

10 messages in this thread

list Buchan Milne · Thu, 28 Feb 2008 20:44:20 +0200 ·
Something I have been wondering about for a while is whether it would be possible to have thresholds on the CPU utilisation. While we have thresholds for load averages, in some cases these have to be relatively high (e.g. 2 to 4 times the number of CPUs) due to the impact of IO wait on load average (e.g, our SAN-attached NFS servers often have a load average of over 10, with a CPU utilisation of 50%, when reading over 10k blocks/sec). However, it then makes it difficult to catch a process in CPU-race (as much less IO gets done, IO wait is low, and load average is almost exactly 1 *CPUs).

The CPU utilisation is already reported (in the vmstat data), which is how I know the above about our NFS servers (vmstat/vmstat1 graph).

This would also remove the complication of thresholds differing between servers with different numbers of CPUs, and maybe work better for Windows clients (which don't seem to have a concept of load average).

(I don't mean thresholds for load average should be removed ... I would love to have thresholds for both load average and CPU utilisation).

Regards,
Buchan
list Taylor Lewick · Thu, 28 Feb 2008 13:32:27 -0600 ·
Funny you brought this up just now, because today I noticed if you load
the windows client, either bbwin or bbnt, those allow you to set alerts
for CPU utilization, but both big brother and hobbit only understand
load average, so I keep getting alerts saying load is very high, when
the cpus are around 20-50%  Well a load of 20 on a linux/unix server
would be very high, but Windows boxes don't really have the load average
concept, just the cpu utilization, so if you are monitoring utilization
on windows clients you have to change the load to something like 70 90
to avoid getting red pages.

-----Original Message-----
From: Buchan Milne [mailto:user-9b139aff4dec@xymon.invalid] 
Sent: Thursday, February 28, 2008 12:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Feature request - thresholds for CPU utilisation (not
load average)
quoted from Buchan Milne

Something I have been wondering about for a while is whether it would be

possible to have thresholds on the CPU utilisation. While we have
thresholds 
for load averages, in some cases these have to be relatively high (e.g.
2 to 
4 times the number of CPUs) due to the impact of IO wait on load average

(e.g, our SAN-attached NFS servers often have a load average of over 10,
with 
a CPU utilisation of 50%, when reading over 10k blocks/sec). However, it
then 
makes it difficult to catch a process in CPU-race (as much less IO gets
done, 
IO wait is low, and load average is almost exactly 1 *CPUs).

The CPU utilisation is already reported (in the vmstat data), which is
how I 
know the above about our NFS servers (vmstat/vmstat1 graph).

This would also remove the complication of thresholds differing between 
servers with different numbers of CPUs, and maybe work better for
Windows 
clients (which don't seem to have a concept of load average).

(I don't mean thresholds for load average should be removed ... I would
love 
to have thresholds for both load average and CPU utilisation).

Regards,
Buchan
list Tom Kauffman · Thu, 28 Feb 2008 14:33:20 -0500 ·
I'll second that.

I just found out we had a test system that has had an oracle process using 99% of one cpu for the past (drumroll!) two months and we didn't notice it!

Tom Kauffman
NIBCO, Inc
quoted from Buchan Milne

-----Original Message-----
From: Buchan Milne [mailto:user-9b139aff4dec@xymon.invalid]
Sent: Thursday, February 28, 2008 1:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Feature request - thresholds for CPU utilisation (not load average)

Something I have been wondering about for a while is whether it would be
possible to have thresholds on the CPU utilisation. While we have thresholds
for load averages, in some cases these have to be relatively high (e.g. 2 to
4 times the number of CPUs) due to the impact of IO wait on load average
(e.g, our SAN-attached NFS servers often have a load average of over 10, with
a CPU utilisation of 50%, when reading over 10k blocks/sec). However, it then
makes it difficult to catch a process in CPU-race (as much less IO gets done,
IO wait is low, and load average is almost exactly 1 *CPUs).

The CPU utilisation is already reported (in the vmstat data), which is how I
know the above about our NFS servers (vmstat/vmstat1 graph).

This would also remove the complication of thresholds differing between
servers with different numbers of CPUs, and maybe work better for Windows
clients (which don't seem to have a concept of load average).

(I don't mean thresholds for load average should be removed ... I would love
to have thresholds for both load average and CPU utilisation).

Regards,
Buchan


CONFIDENTIALITY NOTICE:  This email and any attachments are for the
exclusive and confidential use of the intended recipient.  If you are not
the intended recipient, please do not read, distribute or take action in
reliance upon this message. If you have received this in error, please
notify us immediately by return email and promptly delete this message
and its attachments from your computer system. We do not waive
attorney-client or work product privilege by the transmission of this
message.
list Josh Luthman · Thu, 28 Feb 2008 14:57:21 -0500 ·
Thirdsies!
quoted from Tom Kauffman

On 2/28/08, Kauffman, Tom <user-3feba9e60a8b@xymon.invalid> wrote:
I'll second that.

I just found out we had a test system that has had an oracle process using
99% of one cpu for the past (drumroll!) two months and we didn't notice it!

Tom Kauffman
NIBCO, Inc


-----Original Message-----
From: Buchan Milne [mailto:user-9b139aff4dec@xymon.invalid]
Sent: Thursday, February 28, 2008 1:44 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] Feature request - thresholds for CPU utilisation (not
load average)

Something I have been wondering about for a while is whether it would be
possible to have thresholds on the CPU utilisation. While we have
thresholds
for load averages, in some cases these have to be relatively high (e.g. 2
to
4 times the number of CPUs) due to the impact of IO wait on load average
(e.g, our SAN-attached NFS servers often have a load average of over 10,
with
a CPU utilisation of 50%, when reading over 10k blocks/sec). However, it
then
makes it difficult to catch a process in CPU-race (as much less IO gets
done,
IO wait is low, and load average is almost exactly 1 *CPUs).

The CPU utilisation is already reported (in the vmstat data), which is how
I
know the above about our NFS servers (vmstat/vmstat1 graph).

This would also remove the complication of thresholds differing between
servers with different numbers of CPUs, and maybe work better for Windows
clients (which don't seem to have a concept of load average).

(I don't mean thresholds for load average should be removed ... I would
love
to have thresholds for both load average and CPU utilisation).

Regards,
Buchan


CONFIDENTIALITY NOTICE:  This email and any attachments are for the
exclusive and confidential use of the intended recipient.  If you are not
the intended recipient, please do not read, distribute or take action in
reliance upon this message. If you have received this in error, please
notify us immediately by return email and promptly delete this message
and its attachments from your computer system. We do not waive
attorney-client or work product privilege by the transmission of this
message.

-- 

Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer
list Bill Richardson · Wed, 7 Dec 2011 16:13:22 +0000 ·
I see that Buchan asked for this a few years back. Has anyone done this. I would like to start alerting on %CPU not LOAD. I would still like to graph LOAD and have that show up under trends. The % CPU is already in Trends being graphed it would be nice just to pull that over to the CPU column.

Here is the first request:
http://lists.xymon.com/archive/2008-February/017968.html

Thanks

Bill Richardson
list Henrik Størner · Wed, 07 Dec 2011 22:50:39 +0100 ·
quoted from Bill Richardson
On 07-12-2011 17:13, Bill Richardson wrote:
I see that Buchan asked for this a few years back. Has anyone done this.
I would like to start alerting on %CPU not LOAD. I would still like to
graph LOAD and have that show up under trends. The % CPU is already in
Trends being graphed it would be nice just to pull that over to the CPU
column.
In 4.3.x, add this to your analysis.cfg:

HOST=foo
    DS cpu vmstat.rrd:cpu_idl >=25 COLOR=green TEXT="CPU load normal"
    DS cpu vmstat.rrd:cpu_idl <25  COLOR=yellow TEXT="High CPU load"
    DS cpu vmstat.rrd:cpu_idl <10  COLOR=red TEXT="Critical CPU load"


Regards,
Henrik
list Ralph Mitchell · Thu, 8 Dec 2011 08:32:59 -0500 ·
quoted from Henrik Størner
On Wed, Dec 7, 2011 at 4:50 PM, Henrik Størner <user-ce4a2c883f75@xymon.invalid> wrote:
On 07-12-2011 17:13, Bill Richardson wrote:
I see that Buchan asked for this a few years back. Has anyone done this.
I would like to start alerting on %CPU not LOAD. I would still like to
graph LOAD and have that show up under trends. The % CPU is already in
Trends being graphed it would be nice just to pull that over to the CPU
column.
In 4.3.x, add this to your analysis.cfg:

HOST=foo
  DS cpu vmstat.rrd:cpu_idl >=25 COLOR=green TEXT="CPU load normal"
  DS cpu vmstat.rrd:cpu_idl <25  COLOR=yellow TEXT="High CPU load"
  DS cpu vmstat.rrd:cpu_idl <10  COLOR=red TEXT="Critical CPU load"
FYI: The column name is missing in the DS example in the docs:

Example: Flag "conn" status a yellow if responsetime exceeds
100 msec.
.br
        DS tcp.conn.rrd:sec >0.1 COLOR=yellow TEXT="Response time &V
exceeds &U seconds"


Ralph Mitchell
list Henrik Størner · Thu, 08 Dec 2011 14:41:46 +0100 ·
On Thu, 8 Dec 2011 08:32:59 -0500, Ralph Mitchell
<user-00a5e44c48c0@xymon.invalid>
quoted from Ralph Mitchell
wrote: 
FYI: The column name is missing in the DS example in the docs:
Thanks - fixed.


Regards,
Henrik
list Ralph Mitchell · Thu, 8 Dec 2011 08:51:37 -0500 ·
quoted from Henrik Størner
On Thu, Dec 8, 2011 at 8:41 AM, <user-ce4a2c883f75@xymon.invalid> wrote:
On Thu, 8 Dec 2011 08:32:59 -0500, Ralph Mitchell
<user-00a5e44c48c0@xymon.invalid>
wrote:
FYI: The column name is missing in the DS example in the docs:
Thanks - fixed.

Also, when I put in

     TEXT="cpu load....."

the opening double-quote shows in the display.  Putting the double-quote
before the TEXT:

     "TEXT=cpu load......"

makes it come out OK.  I don't know if that's a documentation issue or
something in the code that processes analysis.cfg.

Thanks!

Ralph Mitchell
list Bill Richardson · Thu, 8 Dec 2011 14:55:24 +0000 ·
Great info... Having the ability to alert on the rrd data is great!

Thank you!
quoted from Ralph Mitchell

From: xymon-bounces at xymon.com [mailto:xymon-bounces at xymon.com] On Behalf Of Ralph Mitchell
Sent: Thursday, December 08, 2011 8:33 AM
To: Henrik Størner
Cc: xymon at xymon.com
Subject: Re: [Xymon] Feature request - thresholds for CPU utilisation (not load average)

On Wed, Dec 7, 2011 at 4:50 PM, Henrik Størner <user-ce4a2c883f75@xymon.invalid<mailto:user-ce4a2c883f75@xymon.invalid>> wrote:
On 07-12-2011 17:13, Bill Richardson wrote:
I see that Buchan asked for this a few years back. Has anyone done this.
I would like to start alerting on %CPU not LOAD. I would still like to
graph LOAD and have that show up under trends. The % CPU is already in
Trends being graphed it would be nice just to pull that over to the CPU
column.

In 4.3.x, add this to your analysis.cfg:

HOST=foo
  DS cpu vmstat.rrd:cpu_idl >=25 COLOR=green TEXT="CPU load normal"
  DS cpu vmstat.rrd:cpu_idl <25  COLOR=yellow TEXT="High CPU load"
  DS cpu vmstat.rrd:cpu_idl <10  COLOR=red TEXT="Critical CPU load"


FYI: The column name is missing in the DS example in the docs:

Example: Flag "conn" status a yellow if responsetime exceeds
100 msec.
.br
        DS tcp.conn.rrd:sec >0.1 COLOR=yellow TEXT="Response time &V exceeds &U seconds"

Ralph Mitchell