Xymon Mailing List Archive search

MEMPHYS went nuts

7 messages in this thread

list Jaime Kikpole · Sat, 19 Dec 2009 23:47:50 -0500 ·
I've been setting up Xymon since I heard about it a couple of days
ago.  So far, its fantastic.  Unfortunately, after a UPS malfunction
that caused a whole room to restart, I have an inexplicable alarm.
Take a look at this red alarm:

 Memory              Used       Total  Percentage
 Physical     4294964640M       4084M 4294967231%
 Swap                  0M       4096M          0%

This is from a FreeBSD 7.x box that has 8GB of physical RAM and 4GB of
swap.  According to "top", the physical memory is still about 6.75GB
free.

I've restarted the daemon on the host in question as well as on
Xymon's GUI/web server.  I've even tried a bin/bb 127.0.0.1 "drop
HOSTNAME memory" just to see if it would help.

I'm not even sure where to start on this.  It was working well for a
day or so before going bad like this.  My best guess right now is some
kind of "wrap-around" problem with an integer or something like that
causing bad data in memory.

Suggestions?

Thanks in advance,
Jaime

P.S. - The alarm is at the following URL, if that helps:
http://cns.cairodurham.org/hobbit-cgi/bb-hostsvc.sh?HOST=cerberus.cairodurham.org&SERVICE=memory

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Xymon User in Richmond · Sun, 20 Dec 2009 07:37:34 -0500 ·
quoted from Jaime Kikpole
On Sat, December 19, 2009 23:47, Jaime Kikpole wrote:
I've been setting up Xymon since I heard about it a couple of days
ago.  So far, its fantastic.  Unfortunately, after a UPS malfunction
that caused a whole room to restart, I have an inexplicable alarm.
Take a look at this red alarm:

 Memory              Used       Total  Percentage
 Physical     4294964640M       4084M 4294967231%
 Swap                  0M       4096M          0%

This is from a FreeBSD 7.x box that has 8GB of physical RAM and 4GB of
swap.  According to "top", the physical memory is still about 6.75GB
free.

I've restarted the daemon on the host in question as well as on
Xymon's GUI/web server.  I've even tried a bin/bb 127.0.0.1 "drop
HOSTNAME memory" just to see if it would help.

I'm not even sure where to start on this.  It was working well for a
day or so before going bad like this.  My best guess right now is some
kind of "wrap-around" problem with an integer or something like that
causing bad data in memory.

Suggestions?
I have not touched BSD in about two years, so just a stab with a rusty
fork.  My gut feel is same as yours, integer wrap or memory mismap sort of
thing.

How is this  host maintained WRT kernel and ports?  Any chance the kernel
or related components have been updated since the last previous reboot
and/or since the Xymon build, and that this reboot loaded the changes?  If
you're using freebsd-upgrade, could you be in the stage 1 reboot after a
kernel update now, i.e. needing to do another "freebsd-update install" to
complete?  I'd look at that, and also at possibly rebuilding/installing
Xymon.
list Jaime Kikpole · Sun, 20 Dec 2009 09:46:30 -0500 ·
On Sunday, December 20, 2009, Xymon User in Richmond
quoted from Xymon User in Richmond
<user-24d6f8323faa@xymon.invalid> wrote:
Any chance the kernel
or related components have been updated since the last previous reboot
and/or since the Xymon build, and that this reboot loaded the changes?
Not a bad question, but no.  There have been no changes in the kernel
or OS for a little while now.  In fact I am hoping to have a chance to
do an update in about two weeks.

The system did reboot unexpectedly, though.  I wouldn't have expected
that to have an effect.  What do you think?

I tried a "controlled" restate just now via shutdown -r now.  After
giving the systema few minutes to talk to itself, it is still
reporting strangely high numbers.

Would it make sense to pkg_delete the Xymon daemon on the observed
host and reinstall it?  Or could that make things worse?

Thanks,
Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Ralph Mitchell · Sun, 20 Dec 2009 10:27:03 -0500 ·
quoted from Jaime Kikpole
On Sun, Dec 20, 2009 at 9:46 AM, Jaime Kikpole <user-c575ba5bb612@xymon.invalid>wrote:
On Sunday, December 20, 2009, Xymon User in Richmond
<user-24d6f8323faa@xymon.invalid> wrote:
Any chance the kernel
or related components have been updated since the last previous reboot
and/or since the Xymon build, and that this reboot loaded the changes?
Not a bad question, but no.  There have been no changes in the kernel
or OS for a little while now.  In fact I am hoping to have a chance to
do an update in about two weeks.

The system did reboot unexpectedly, though.  I wouldn't have expected
that to have an effect.  What do you think?

I tried a "controlled" restate just now via shutdown -r now.  After
giving the systema few minutes to talk to itself, it is still
reporting strangely high numbers.

Would it make sense to pkg_delete the Xymon daemon on the observed
host and reinstall it?  Or could that make things worse?
If you click through the "client data available" link, you'll see this near
the top:

     [meminfo]
     Total:4084
     Free:6920
     [swapinfo]
     Device          1K-blocks     Used    Avail Capacity
     /dev/da0s1b       4194304        0  4194304     0%


That "Total:4084" is supposed to be the total physical memory in the system,
if I'm reading the freebsd-meminfo.c source correctly.  If you have the
xymon source, that's under the "client" directory.

If that system is supposed to have 8G of memory, I think the kernel may not
be seeing half of it...

Ralph Mitchell
list Jaime Kikpole · Sun, 20 Dec 2009 13:43:16 -0500 ·
On Sun, Dec 20, 2009 at 10:27 AM, Ralph Mitchell
quoted from Ralph Mitchell
<user-00a5e44c48c0@xymon.invalid> wrote:
That "Total:4084" is supposed to be the total physical memory in the system,
if I'm reading the freebsd-meminfo.c source correctly.  If you have the
xymon source, that's under the "client" directory.
Thanks.  I'm going to check on that now.
quoted from Ralph Mitchell

If that system is supposed to have 8G of memory, I think the kernel may not
be seeing half of it...
That makes some sense.  It doesn't explain why it thinks that its
using about 4 billion percent of capacity, though.  That would be
40,000,000 times capacity.  Is there someplace that the unit of
measurement is set?

Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Jaime Kikpole · Sun, 20 Dec 2009 14:03:51 -0500 ·
quoted from Jaime Kikpole
On Sun, Dec 20, 2009 at 10:27 AM, Ralph Mitchell
<user-00a5e44c48c0@xymon.invalid> wrote:
If that system is supposed to have 8G of memory, I think the kernel may not
be seeing half of it...
I just checked at the command line:

cerberus# dmesg | grep memory
real memory  = 9126805504 (8704 MB)
avail memory = 8403456000 (8014 MB)
cerberus# sysctl -a | grep -e '^hw\.' | grep 'mem: '
hw.physmem: 4282679296
hw.usermem: 4162433024
hw.realmem: 536870912

You may be on to something.

Thanks.  I'll check this out in the kernel options, etc.

Jaime

-- 
Network Administrator
Cairo-Durham Central School District
http://cns.cairodurham.org
list Xymon User in Richmond · Sun, 20 Dec 2009 15:00:43 -0500 ·
quoted from Jaime Kikpole
On Sun, December 20, 2009 13:43, Jaime Kikpole wrote:
On Sun, Dec 20, 2009 at 10:27 AM, Ralph Mitchell
<user-00a5e44c48c0@xymon.invalid> wrote:
That "Total:4084" is supposed to be the total physical memory in the
system, if I'm reading the freebsd-meminfo.c source correctly.  If you
have the xymon source, that's under the "client" directory.
Thanks.  I'm going to check on that now.

If that system is supposed to have 8G of memory, I think the kernel may
 not be seeing half of it...
That makes some sense.  It doesn't explain why it thinks that its using
about 4 billion percent of capacity, though.  That would be 40,000,000
times capacity.  Is there someplace that the unit of measurement is set?
Can't get my head around the math on a Sunday afternoon, but if it's
trying to do calculations using what it can see is used and what it's been
told is the total available, which may be less, the results would be some
kind of hosed.