Xymon Mailing List Archive search

Linux load question

4 messages in this thread

list John Rothlisberger · Thu, 8 Mar 2018 13:24:52 +0000 ·
I have a linux server which is alerting on a high load but the load average is lower than my threshold.  My question, why is it going red?

Analysis.cfg
HOST=serverA
        LOAD    89.0 90.0

Here are the top results - I expect that the alert should be triggered by the load average of 21.75 which is far lower than the thresholds.
top - 07:58:55 up 17 days, 19:31, 20 users,  load average: 21.75, 25.16, 25.32
...

But, there is a single process using a ton of cpu on one of the multiple cores - is this factoring into the alert?  If so, why is it not documented?
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
121978 user1+    20   0 46.839g 0.040t  50388 S  2774 17.2  18504:34 python

Thanks,
John
Upcoming PTO:
John Rothlisberger
IT Strategy, Infrastructure & Security - Technology Growth Platform
TGP for Business Process Outsourcing
Accenture
XXX.XXX.XXXX office


This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.

www.accenture.com
list Galen Johnson · Thu, 8 Mar 2018 10:11:47 -0500 ·
Any chance you have another entry that could be overriding that setting?
Or maybe it's not matching he entry and falling aback to default?

=G=

On Thu, Mar 8, 2018 at 8:24 AM, Rothlisberger, John R. <
quoted from John Rothlisberger
user-7adce57665bb@xymon.invalid> wrote:
I have a linux server which is alerting on a high load but the load
average is lower than my threshold.  My question, why is it going red?


Analysis.cfg

HOST=serverA

        LOAD    89.0 90.0


Here are the top results – I expect that the alert should be triggered by
the load average of 21.75 which is far lower than the thresholds.

top - 07:58:55 up 17 days, 19:31, 20 users,  load average: 21.75, 25.16,
25.32

...


But, there is a single process using a ton of cpu on one of the multiple
cores – is this factoring into the alert?  If so, why is it not documented?

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND

121978 user1+    20   0 46.839g 0.040t  50388 S  2774 17.2  18504:34 python


Thanks,

John

Upcoming PTO:


John Rothlisberger

IT Strategy, Infrastructure & Security - Technology Growth Platform

TGP for Business Process Outsourcing

Accenture

XXX.XXX.XXXX <(312)%20693-3136> office
quoted from John Rothlisberger


This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise confidential information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the e-mail by you is prohibited. Where allowed
by local law, electronic communications with Accenture and its affiliates,
including e-mail and instant messaging (including content), may be scanned
by our systems for the purposes of information security and assessment of
internal compliance with Accenture policy.

www.accenture.com

list Larry Bonham · Thu, 8 Mar 2018 15:57:26 +0000 ·
John,

You can test that with:

./bin/xymoncmd xymond_alert --test serverA load

And confirm that the alerts.cfg line you think is handling it really is.

Also the LOAD check is looking at the 5 minute load.  Not the 1 minute.  So in your example it is triggering at 25.16.  Which still isn’t the level you wanted.

Larry
quoted from Galen Johnson

From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Galen Johnson
Sent: Thursday, March 8, 2018 9:12 AM
To: Rothlisberger, John R.
Cc: xymon >> xymon at xymon.com
Subject: Re: [Xymon] Linux load question

Any chance you have another entry that could be overriding that setting?  Or maybe it's not matching he entry and falling aback to default?
=G=

On Thu, Mar 8, 2018 at 8:24 AM, Rothlisberger, John R. <user-7adce57665bb@xymon.invalid<mailto:user-7adce57665bb@xymon.invalid>> wrote:
I have a linux server which is alerting on a high load but the load average is lower than my threshold.  My question, why is it going red?

Analysis.cfg
HOST=serverA
        LOAD    89.0 90.0

Here are the top results – I expect that the alert should be triggered by the load average of 21.75 which is far lower than the thresholds.
top - 07:58:55 up 17 days, 19:31, 20 users,  load average: 21.75, 25.16, 25.32
...

But, there is a single process using a ton of cpu on one of the multiple cores – is this factoring into the alert?  If so, why is it not documented?
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
121978 user1+    20   0 46.839g 0.040t  50388 S  2774 17.2  18504:34 python

Thanks,
John
Upcoming PTO:
John Rothlisberger
IT Strategy, Infrastructure & Security - Technology Growth Platform
TGP for Business Process Outsourcing
Accenture
XXX.XXX.XXXX office


This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.

www.accenture.com<http://www.accenture.com>;


CONFIDENTIALITY NOTICE:
This electronic mail message is intended exclusively for
recipient to which it is addressed. The contents of this message
and any attachments may contain confidential and privileged
information. Any unauthorized review, use, print, storage, copy,
disclosure or distribution is strictly prohibited. If you have
received this message in error, please advise the sender
immediately by replying to the message's sender and delete all
copies of this message and its attachments without disclosing
the contents to anyone, or using the contents for any purpose.
list John Rothlisberger · Fri, 9 Mar 2018 01:22:45 +0000 ·
Defaults are similar (not below 25 anyway).

So, it’s it because there are multiple cpu’s?  Is there some setting that I could use?

Or, is there a way to find the exact value/setting that Xymon is using to change this to red?  The CPU graph shows that the total cpu % doesn’t really go above 30%.

Thanks,
John
quoted from Galen Johnson

From: Galen Johnson [mailto:user-fc632e705d24@xymon.invalid]
Sent: Thursday, March 8, 2018 9:12 AM
To: Rothlisberger, John R. <user-7adce57665bb@xymon.invalid>
Cc: xymon >> xymon at xymon.com <xymon at xymon.com>
Subject: [External] Re: [Xymon] Linux load question

Any chance you have another entry that could be overriding that setting?  Or maybe it's not matching he entry and falling aback to default?
=G=

On Thu, Mar 8, 2018 at 8:24 AM, Rothlisberger, John R. <user-7adce57665bb@xymon.invalid<mailto:user-7adce57665bb@xymon.invalid>> wrote:
I have a linux server which is alerting on a high load but the load average is lower than my threshold.  My question, why is it going red?

Analysis.cfg
HOST=serverA
        LOAD    89.0 90.0

Here are the top results – I expect that the alert should be triggered by the load average of 21.75 which is far lower than the thresholds.
top - 07:58:55 up 17 days, 19:31, 20 users,  load average: 21.75, 25.16, 25.32
...

But, there is a single process using a ton of cpu on one of the multiple cores – is this factoring into the alert?  If so, why is it not documented?
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
121978 user1+    20   0 46.839g 0.040t  50388 S  2774 17.2  18504:34 python

Thanks,
John
Upcoming PTO:
John Rothlisberger
IT Strategy, Infrastructure & Security - Technology Growth Platform
TGP for Business Process Outsourcing
Accenture

XXX.XXX.XXXX<tel:(312)%20693-3136> office
quoted from Larry Bonham


This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.