Xymon Mailing List Archive search

Improving memory monitoring

5 messages in this thread

list Steve Hill · Tue, 14 Apr 2015 12:56:07 +0100 ·
I'm working on improving my Xymon configuration to reduce the number of false alerts that we get.  In particular, memory monitoring is a bit of a problem so I'm hoping someone will be able to offer some advice.

At the moment, Xymon is set up with something like:

MEMPHYS 100 101
MEMSWAP 20 40
MEMACT 95 97

I pretty much don't care about MEMPHYS.  The problem with MEMSWAP and MEMACT is that they work independently or each other - i.e. the above will give me an alert if > 97% of the RAM is used OR > 40% of swap is used.

However, this results in warnings for systems that have a lot of idle data in memory.  The Linux kernel will page out idle data (increasing swap usage and reducing RAM usage) and use that space for buffers/caches, and this is a very sensible strategy.  Unfortunately, then Xymon comes along and notices that there's lots of swap in use and throws an alert, even though there's plenty of RAM free.

Basically, I don't care that a machine is 4GB into swap if it has 5GB of free ram - that isn't a problem, it just means there's quite a lot of idle data that the kernel has decided can be paged out.  I do care if it's 4GB into swap and only has 0.5GB of free RAM since this would indicate that it's actually short of memory.

What I really need is to warn if > x% of the RAM is used AND > y% of swap is used - is there a way to do that?

Thanks.

-- 
  - Steve Hill
    Technical Director
    Opendium Limited     http://www.opendium.com

Direct contacts:
    Instant messager: xmpp:user-8cda31fbea61@xymon.invalid
    Email:            user-8cda31fbea61@xymon.invalid
    Phone:            sip:user-8cda31fbea61@xymon.invalid

Sales / enquiries contacts:
    Email:            user-2675bcaab7d4@xymon.invalid
    Phone:            +XX-XXXX-XXXXXX / sip:user-2675bcaab7d4@xymon.invalid

Support contacts:
    Email:            user-126f03e2871f@xymon.invalid
    Phone:            +XX-XXXX-XXXXXX / sip:user-126f03e2871f@xymon.invalid
list Mark Felder · Tue, 14 Apr 2015 08:37:36 -0500 ·
quoted from Steve Hill

On Tue, Apr 14, 2015, at 06:56, Steve Hill wrote:
I'm working on improving my Xymon configuration to reduce the number of false alerts that we get.  In particular, memory monitoring is a bit of a problem so I'm hoping someone will be able to offer some advice.

At the moment, Xymon is set up with something like:

MEMPHYS 100 101
MEMSWAP 20 40
MEMACT 95 97

I pretty much don't care about MEMPHYS.  The problem with MEMSWAP and MEMACT is that they work independently or each other - i.e. the above will give me an alert if > 97% of the RAM is used OR > 40% of swap is
used.

However, this results in warnings for systems that have a lot of idle data in memory.  The Linux kernel will page out idle data (increasing swap usage and reducing RAM usage) and use that space for buffers/caches, and this is a very sensible strategy.  Unfortunately, then Xymon comes along and notices that there's lots of swap in use and throws an alert, even though there's plenty of RAM free.

Basically, I don't care that a machine is 4GB into swap if it has 5GB of free ram - that isn't a problem, it just means there's quite a lot of idle data that the kernel has decided can be paged out.  I do care if it's 4GB into swap and only has 0.5GB of free RAM since this would indicate that it's actually short of memory.

What I really need is to warn if > x% of the RAM is used AND > y% of swap is used - is there a way to do that?
I agree -- this would have been very valuable to me in the past. I'm not
currently aware of a way to do this, but I've recently discovered so
many Xymon features I was unaware of that maybe it's possible...
list Mike Burger · Tue, 14 Apr 2015 10:11:32 -0400 ·
Forgot to reply all.
quoted from Mark Felder

On 2015-04-14 10:00 am, Mike Burger wrote:
On 2015-04-14 7:56 am, Steve Hill wrote:
I'm working on improving my Xymon configuration to reduce the number
of false alerts that we get.  In particular, memory monitoring is a
bit of a problem so I'm hoping someone will be able to offer some
advice.

At the moment, Xymon is set up with something like:

MEMPHYS 100 101
MEMSWAP 20 40
MEMACT 95 97

I pretty much don't care about MEMPHYS.  The problem with MEMSWAP and
MEMACT is that they work independently or each other - i.e. the above
will give me an alert if > 97% of the RAM is used OR > 40% of swap is
used.

However, this results in warnings for systems that have a lot of idle
data in memory.  The Linux kernel will page out idle data (increasing
swap usage and reducing RAM usage) and use that space for
buffers/caches, and this is a very sensible strategy.  Unfortunately,
then Xymon comes along and notices that there's lots of swap in use
and throws an alert, even though there's plenty of RAM free.

Basically, I don't care that a machine is 4GB into swap if it has 5GB
of free ram - that isn't a problem, it just means there's quite a lot
of idle data that the kernel has decided can be paged out.  I do care
if it's 4GB into swap and only has 0.5GB of free RAM since this would
indicate that it's actually short of memory.

What I really need is to warn if > x% of the RAM is used AND > y% of
swap is used - is there a way to do that?
I'll say that I've never run into this...I've never had a system swap
memory out to disk unless active memory was utilized at a high
percentage...in either AIX or Linux.

In AIX, there is some sort of algorithm in place where, if a process'
memory has been swapped out and then swapped back in, the memory
manager holds onto the paging space until either something else needs
paging space or the previously swapped out process ends, but I don't
think I've ever seen a situation in Linux where idle memory pages were
swapped to disk and physical/active memory had some large percentage
free.

Now, on the other side of this, to take a stab at the question, I'd
wager that, at present, you'd need to script such a test/alert..but I
would agree that it would be useful to be able to set an "alarm if
this or this" or an "alarm if this and this" type scenario. At
present, the only tests I can think of that allow this, "out of the
box" are the process monitors, where you can set minimum and maximum
thresholds.
-- 
Mike Burger
http://www.bubbanfriends.org

"It's always suicide-mission this, save-the-planet that. No one ever 
just stops by to say 'hi' anymore." --Colonel Jack O'Neill, SG1
list Steve Hill · Tue, 14 Apr 2015 15:56:57 +0100 ·
quoted from Mike Burger
On 14/04/15 15:11, Mike Burger wrote:
I'll say that I've never run into this...I've never had a system swap
memory out to disk unless active memory was utilized at a high
percentage...in either AIX or Linux.
It does spontaneously happen from time to time for me - may be the type of work loads these machines do - they do tend to have a fair amount of idle data in memory and the kernel quite rightly decides that using that for caches/buffers would be a better use.

Also, in situations where something _has_ used up a lot of RAM and therefore pushed stuff out to swap, Xymon continues to warn of high swap usage after that process has ended because the kernel obviously won't bother paging stuff back into the newly emptied RAM until it needs to.
quoted from Mike Burger
Now, on the other side of this, to take a stab at the question, I'd
wager that, at present, you'd need to script such a test/alert..but I
would agree that it would be useful to be able to set an "alarm if
this or this" or an "alarm if this and this" type scenario. At
present, the only tests I can think of that allow this, "out of the
box" are the process monitors, where you can set minimum and maximum
thresholds.
Is there a way of setting analysis.cfg to use a script instead of the MEM* directives, or would that need to be a completely external job of some kind?
quoted from Steve Hill

-- 
  - Steve Hill
    Technical Director
    Opendium Limited     http://www.opendium.com

Direct contacts:
    Instant messager: xmpp:user-8cda31fbea61@xymon.invalid
    Email:            user-8cda31fbea61@xymon.invalid
    Phone:            sip:user-8cda31fbea61@xymon.invalid

Sales / enquiries contacts:
    Email:            user-2675bcaab7d4@xymon.invalid
    Phone:            +XX-XXXX-XXXXXX / sip:user-2675bcaab7d4@xymon.invalid

Support contacts:
    Email:            user-126f03e2871f@xymon.invalid
    Phone:            +XX-XXXX-XXXXXX / sip:user-126f03e2871f@xymon.invalid
list Japheth Cleaver · Tue, 14 Apr 2015 09:01:58 -0700 ·
quoted from Steve Hill

On Tue, April 14, 2015 7:56 am, Steve Hill wrote:
On 14/04/15 15:11, Mike Burger wrote:
I'll say that I've never run into this...I've never had a system swap
memory out to disk unless active memory was utilized at a high
percentage...in either AIX or Linux.
It does spontaneously happen from time to time for me - may be the type
of work loads these machines do - they do tend to have a fair amount of
idle data in memory and the kernel quite rightly decides that using that
for caches/buffers would be a better use.

Also, in situations where something _has_ used up a lot of RAM and
therefore pushed stuff out to swap, Xymon continues to warn of high swap
usage after that process has ended because the kernel obviously won't
bother paging stuff back into the newly emptied RAM until it needs to.
Now, on the other side of this, to take a stab at the question, I'd
wager that, at present, you'd need to script such a test/alert..but I
would agree that it would be useful to be able to set an "alarm if
this or this" or an "alarm if this and this" type scenario. At
present, the only tests I can think of that allow this, "out of the
box" are the process monitors, where you can set minimum and maximum
thresholds.
Is there a way of setting analysis.cfg to use a script instead of the
MEM* directives, or would that need to be a completely external job of
some kind?
There's no built in way to support this via analysis.cfg, or -- more
specifically -- xymond_client, and launching a script to do this
per-report would probably run into scaling issues for larger installs.

We've run into the same "things in swap alerting when they don't really
cause problems" issues, although to some extent we were able to work
around it by business policy (eg, "if transient, then please clear swap
out"), but that's not really the best solution.

Even the "out of the box" monitors handling things at the RRD level (the
'DS' directives) won't let you cross compare two distinct thresholds,
although that would be a nice feature.

About the best thing I can think of for immediate use would be to set the
MEM* alerts to 100/101 and write a new channel listener that reads in
incoming messages, does calculations and either a) issues new type of
status message, or b) issues a "modify" message for the existing 'memory'
status when a certain threshold is crossed.


HTH,

-jc