Linux load question
list John Rothlisberger
I have a linux server which is alerting on a high load but the load average is lower than my threshold. My question, why is it going red?
Analysis.cfg
HOST=serverA
LOAD 89.0 90.0
Here are the top results - I expect that the alert should be triggered by the load average of 21.75 which is far lower than the thresholds.
top - 07:58:55 up 17 days, 19:31, 20 users, load average: 21.75, 25.16, 25.32
...
But, there is a single process using a ton of cpu on one of the multiple cores - is this factoring into the alert? If so, why is it not documented?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
121978 user1+ 20 0 46.839g 0.040t 50388 S 2774 17.2 18504:34 python
Thanks,
John
Upcoming PTO:
John Rothlisberger
IT Strategy, Infrastructure & Security - Technology Growth Platform
TGP for Business Process Outsourcing
Accenture
XXX.XXX.XXXX office
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
www.accenture.com
list Galen Johnson
Any chance you have another entry that could be overriding that setting? Or maybe it's not matching he entry and falling aback to default? =G= On Thu, Mar 8, 2018 at 8:24 AM, Rothlisberger, John R. <
▸
user-7adce57665bb@xymon.invalid> wrote:
I have a linux server which is alerting on a high load but the load
average is lower than my threshold. My question, why is it going red?
Analysis.cfg
HOST=serverA
LOAD 89.0 90.0
Here are the top results – I expect that the alert should be triggered by
the load average of 21.75 which is far lower than the thresholds.
top - 07:58:55 up 17 days, 19:31, 20 users, load average: 21.75, 25.16,
25.32
...
But, there is a single process using a ton of cpu on one of the multiple
cores – is this factoring into the alert? If so, why is it not documented?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
121978 user1+ 20 0 46.839g 0.040t 50388 S 2774 17.2 18504:34 python
Thanks,
John
Upcoming PTO:
John Rothlisberger
IT Strategy, Infrastructure & Security - Technology Growth Platform
TGP for Business Process Outsourcing
Accenture
XXX.XXX.XXXX <(312)%20693-3136> office
▸
This message is for the designated recipient only and may contain
privileged, proprietary, or otherwise confidential information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the e-mail by you is prohibited. Where allowed
by local law, electronic communications with Accenture and its affiliates,
including e-mail and instant messaging (including content), may be scanned
by our systems for the purposes of information security and assessment of
internal compliance with Accenture policy.
www.accenture.com
list Larry Bonham
John, You can test that with: ./bin/xymoncmd xymond_alert --test serverA load And confirm that the alerts.cfg line you think is handling it really is. Also the LOAD check is looking at the 5 minute load. Not the 1 minute. So in your example it is triggering at 25.16. Which still isn’t the level you wanted. Larry
▸
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Galen Johnson
Sent: Thursday, March 8, 2018 9:12 AM
To: Rothlisberger, John R.
Cc: xymon >> xymon at xymon.com
Subject: Re: [Xymon] Linux load question
Any chance you have another entry that could be overriding that setting? Or maybe it's not matching he entry and falling aback to default?
=G=
On Thu, Mar 8, 2018 at 8:24 AM, Rothlisberger, John R. <user-7adce57665bb@xymon.invalid<mailto:user-7adce57665bb@xymon.invalid>> wrote:
I have a linux server which is alerting on a high load but the load average is lower than my threshold. My question, why is it going red?
Analysis.cfg
HOST=serverA
LOAD 89.0 90.0
Here are the top results – I expect that the alert should be triggered by the load average of 21.75 which is far lower than the thresholds.
top - 07:58:55 up 17 days, 19:31, 20 users, load average: 21.75, 25.16, 25.32
...
But, there is a single process using a ton of cpu on one of the multiple cores – is this factoring into the alert? If so, why is it not documented?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
121978 user1+ 20 0 46.839g 0.040t 50388 S 2774 17.2 18504:34 python
Thanks,
John
Upcoming PTO:
John Rothlisberger
IT Strategy, Infrastructure & Security - Technology Growth Platform
TGP for Business Process Outsourcing
Accenture
XXX.XXX.XXXX office
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.
www.accenture.com<http://www.accenture.com>; CONFIDENTIALITY NOTICE: This electronic mail message is intended exclusively for recipient to which it is addressed. The contents of this message and any attachments may contain confidential and privileged information. Any unauthorized review, use, print, storage, copy, disclosure or distribution is strictly prohibited. If you have received this message in error, please advise the sender immediately by replying to the message's sender and delete all copies of this message and its attachments without disclosing the contents to anyone, or using the contents for any purpose.
list John Rothlisberger
Defaults are similar (not below 25 anyway). So, it’s it because there are multiple cpu’s? Is there some setting that I could use? Or, is there a way to find the exact value/setting that Xymon is using to change this to red? The CPU graph shows that the total cpu % doesn’t really go above 30%. Thanks, John
▸
From: Galen Johnson [mailto:user-fc632e705d24@xymon.invalid]
Sent: Thursday, March 8, 2018 9:12 AM
To: Rothlisberger, John R. <user-7adce57665bb@xymon.invalid>
Cc: xymon >> xymon at xymon.com <xymon at xymon.com>
Subject: [External] Re: [Xymon] Linux load question
Any chance you have another entry that could be overriding that setting? Or maybe it's not matching he entry and falling aback to default?
=G=
On Thu, Mar 8, 2018 at 8:24 AM, Rothlisberger, John R. <user-7adce57665bb@xymon.invalid<mailto:user-7adce57665bb@xymon.invalid>> wrote:
I have a linux server which is alerting on a high load but the load average is lower than my threshold. My question, why is it going red?
Analysis.cfg
HOST=serverA
LOAD 89.0 90.0
Here are the top results – I expect that the alert should be triggered by the load average of 21.75 which is far lower than the thresholds.
top - 07:58:55 up 17 days, 19:31, 20 users, load average: 21.75, 25.16, 25.32
...
But, there is a single process using a ton of cpu on one of the multiple cores – is this factoring into the alert? If so, why is it not documented?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
121978 user1+ 20 0 46.839g 0.040t 50388 S 2774 17.2 18504:34 python
Thanks,
John
Upcoming PTO:
John Rothlisberger
IT Strategy, Infrastructure & Security - Technology Growth Platform
TGP for Business Process Outsourcing
Accenture
XXX.XXX.XXXX<tel:(312)%20693-3136> office
▸
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy.