strange graph behavior - random machines & graphs

60 messages in this thread

list Gary Baluha · Wed, 28 Nov 2007 10:08:56 -0500 ·

I have recently noticed a strange thing happening with some of the rrd
graphs generated by Hobbit.  When you look at the graph, it looks as though
the rrd data is one one format (gauge), but the graph is generating it in a
different format (derive).  I can't seem to find any pattern to the hosts or
tests that are exhibiting this strange behavior, and it is only happening on
a handful of graphs.  I have attached a picture of one of these graphs,
since I'm not really sure how to describe it.  Note the huge numbers
displayed on the curr/min/avg/max line.

Any idea what's going on here?  When I dump the RRD file manually,
everything looks okay.  I'm running Hobbit 4.2.0 with the 2007-02-09
allinone patch (I believe the latest).  This has only happened in the past
few weeks, though when exactly it started, I don't know.  Any ideas?

list Gary Baluha · Wed, 28 Nov 2007 17:39:11 -0500 ·

Actually, it looks like this is affecting more hosts than I originally
thought.

▸ quoted from Gary Baluha


On Nov 28, 2007 10:08 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I have recently noticed a strange thing happening with some of the rrd
graphs generated by Hobbit.  When you look at the graph, it looks as though
the rrd data is one one format (gauge), but the graph is generating it in a
different format (derive).  I can't seem to find any pattern to the hosts or
tests that are exhibiting this strange behavior, and it is only happening on
a handful of graphs.  I have attached a picture of one of these graphs,
since I'm not really sure how to describe it.  Note the huge numbers
displayed on the curr/min/avg/max line.

Any idea what's going on here?  When I dump the RRD file manually,
everything looks okay.  I'm running Hobbit 4.2.0 with the 2007-02-09
allinone patch (I believe the latest).  This has only happened in the past
few weeks, though when exactly it started, I don't know.  Any ideas?

list Gary Baluha · Thu, 29 Nov 2007 11:25:34 -0500 ·

Hmm, I just noticed that two completely separate Hobbit servers are having
the exact same problem.  Again, there is no pattern to which graphs are
having this problem, and which ones are not.  I even upgraded to the latest
version of rrdtool (from 1.2.23 to 1.2.26) and that didn't seem to help.

▸ quoted from Gary Baluha


On Nov 28, 2007 5:39 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

Actually, it looks like this is affecting more hosts than I originally
thought.


On Nov 28, 2007 10:08 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I have recently noticed a strange thing happening with some of the rrd
graphs generated by Hobbit.  When you look at the graph, it looks as though
the rrd data is one one format (gauge), but the graph is generating it in a
different format (derive).  I can't seem to find any pattern to the hosts or
tests that are exhibiting this strange behavior, and it is only happening on
a handful of graphs.  I have attached a picture of one of these graphs,
since I'm not really sure how to describe it.  Note the huge numbers
displayed on the curr/min/avg/max line.

Any idea what's going on here?  When I dump the RRD file manually,
everything looks okay.  I'm running Hobbit 4.2.0 with the 2007-02-09
allinone patch (I believe the latest).  This has only happened in the past
few weeks, though when exactly it started, I don't know.  Any ideas?

list Josh Luthman · Thu, 29 Nov 2007 12:01:35 -0500 ·

This is completely beyond my knowledge, but the first place I would look at
is any hardware problems, any recent changes (obviously =) and the
similarities between those two Hobbit servers issues.

▸ quoted from Gary Baluha


On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

Hmm, I just noticed that two completely separate Hobbit servers are having
the exact same problem.  Again, there is no pattern to which graphs are
having this problem, and which ones are not.  I even upgraded to the latest
version of rrdtool (from 1.2.23 to 1.2.26) and that didn't seem to help.

On Nov 28, 2007 5:39 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

Actually, it looks like this is affecting more hosts than I originally
thought.


On Nov 28, 2007 10:08 AM, Gary Baluha < user-ae3e15c22de1@xymon.invalid> wrote:

I have recently noticed a strange thing happening with some of the rrd
graphs generated by Hobbit.  When you look at the graph, it looks as though
the rrd data is one one format (gauge), but the graph is generating it in a
different format (derive).  I can't seem to find any pattern to the hosts or
tests that are exhibiting this strange behavior, and it is only happening on
a handful of graphs.  I have attached a picture of one of these graphs,
since I'm not really sure how to describe it.  Note the huge numbers
displayed on the curr/min/avg/max line.

Any idea what's going on here?  When I dump the RRD file manually,
everything looks okay.  I'm running Hobbit 4.2.0 with the 2007-02-09
allinone patch (I believe the latest).  This has only happened in the past
few weeks, though when exactly it started, I don't know.  Any ideas?

--


Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Thu, 29 Nov 2007 14:15:56 -0500 ·

▸ quoted from Josh Luthman

On Nov 29, 2007 12:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

This is completely beyond my knowledge, but the first place I would look
at is any hardware problems, any recent changes (obviously =) and the
similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these two machines.
They are different hardware, different OS, different network segment, and
different hosts being monitored.

There were some recent changes in the past month to one of the hobbit
servers, with a bunch of custom RRD graphs added.  But this wasn't done on
the other hobbit server.  The only thing changed on the other hobbit server
is more html web checks added; nothing out of the ordinary.

list Josh Luthman · Thu, 29 Nov 2007 14:33:22 -0500 ·

Is this problem not showing up on another Hobbit server?  Do the two Hobbit
servers with this problem communicate at all (share data/SNMP traffic/etc)?

▸ quoted from Gary Baluha


On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid>
wrote:

This is completely beyond my knowledge, but the first place I would look
at is any hardware problems, any recent changes (obviously =) and the
similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these two
machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the hobbit
servers, with a bunch of custom RRD graphs added.  But this wasn't done on
the other hobbit server.  The only thing changed on the other hobbit server
is more html web checks added; nothing out of the ordinary.

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Thu, 29 Nov 2007 14:37:50 -0500 ·

We only have two Hobbit servers, and it is affecting both machines.  No,
these two Hobbit machines do _not_ communicate with each other in any way.

▸ quoted from Josh Luthman


On Nov 29, 2007 2:33 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Is this problem not showing up on another Hobbit server?  Do the two
Hobbit servers with this problem communicate at all (share data/SNMP
traffic/etc)?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid>
wrote:

This is completely beyond my knowledge, but the first place I would
look at is any hardware problems, any recent changes (obviously =) and the
similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these two
machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the hobbit
servers, with a bunch of custom RRD graphs added.  But this wasn't done on
the other hobbit server.  The only thing changed on the other hobbit server
is more html web checks added; nothing out of the ordinary.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Thu, 29 Nov 2007 14:39:10 -0500 ·

Actually, that's not entirely true.  The Hobbit server I run at home, which
is obviously in no way at all related, is _not_ having this problem.  It
does have in common the same version of Hobbit and rrdtool, however.

▸ quoted from Gary Baluha


On Nov 29, 2007 2:37 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

We only have two Hobbit servers, and it is affecting both machines.  No,
these two Hobbit machines do _not_ communicate with each other in any way.


On Nov 29, 2007 2:33 PM, Josh Luthman < user-4c45a83f15cb@xymon.invalid>
wrote:

Is this problem not showing up on another Hobbit server?  Do the two
Hobbit servers with this problem communicate at all (share data/SNMP
traffic/etc)?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid>
wrote:

This is completely beyond my knowledge, but the first place I would
look at is any hardware problems, any recent changes (obviously =) and the
similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these two
machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the hobbit
servers, with a bunch of custom RRD graphs added.  But this wasn't done on
the other hobbit server.  The only thing changed on the other hobbit server
is more html web checks added; nothing out of the ordinary.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Josh Luthman · Thu, 29 Nov 2007 14:53:28 -0500 ·

Do they monitor the same devices?  I think there has to be some similarity
between the two as they had the same problem at the same time (though this
isn't 100%, it's logically the first place to look).  Hardware isn't of much
concern here as they don't communicate and the chances of both servers going
bad on the same date is simply astronomical.

Are there any kind of auto updating services running on them?

▸ quoted from Gary Baluha


On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

We only have two Hobbit servers, and it is affecting both machines.  No,
these two Hobbit machines do _not_ communicate with each other in any way.

On Nov 29, 2007 2:33 PM, Josh Luthman < user-4c45a83f15cb@xymon.invalid>
wrote:

Is this problem not showing up on another Hobbit server?  Do the two
Hobbit servers with this problem communicate at all (share data/SNMP
traffic/etc)?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid>
wrote:

This is completely beyond my knowledge, but the first place I would
look at is any hardware problems, any recent changes (obviously =) and the
similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these two
machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the hobbit
servers, with a bunch of custom RRD graphs added.  But this wasn't done on
the other hobbit server.  The only thing changed on the other hobbit server
is more html web checks added; nothing out of the ordinary.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Josh Luthman · Thu, 29 Nov 2007 15:11:55 -0500 ·

Same OS at home?

Not sure if you mentioned this or not but does that weird value show up in
all RRD graphs or just a few hosts?

▸ quoted from Josh Luthman


On 11/29/07, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Do they monitor the same devices?  I think there has to be some similarity
between the two as they had the same problem at the same time (though this
isn't 100%, it's logically the first place to look).  Hardware isn't of much
concern here as they don't communicate and the chances of both servers going
bad on the same date is simply astronomical.

Are there any kind of auto updating services running on them?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid > wrote:

We only have two Hobbit servers, and it is affecting both machines.  No,
these two Hobbit machines do _not_ communicate with each other in any way.

On Nov 29, 2007 2:33 PM, Josh Luthman < user-4c45a83f15cb@xymon.invalid>
wrote:

Is this problem not showing up on another Hobbit server?  Do the two
Hobbit servers with this problem communicate at all (share data/SNMP
traffic/etc)?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid>
wrote:

This is completely beyond my knowledge, but the first place I
would look at is any hardware problems, any recent changes (obviously =) and
the similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these two
machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the
hobbit servers, with a bunch of custom RRD graphs added.  But this wasn't
done on the other hobbit server.  The only thing changed on the other hobbit
server is more html web checks added; nothing out of the ordinary.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Thu, 29 Nov 2007 15:19:58 -0500 ·

They have nothing whatsoever in common.  There are no updating services
either.

▸ quoted from Josh Luthman


On Nov 29, 2007 2:53 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Do they monitor the same devices?  I think there has to be some similarity
between the two as they had the same problem at the same time (though this
isn't 100%, it's logically the first place to look).  Hardware isn't of much
concern here as they don't communicate and the chances of both servers going
bad on the same date is simply astronomical.

Are there any kind of auto updating services running on them?


On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid > wrote:

We only have two Hobbit servers, and it is affecting both machines.  No,
these two Hobbit machines do _not_ communicate with each other in any way.

On Nov 29, 2007 2:33 PM, Josh Luthman < user-4c45a83f15cb@xymon.invalid>
wrote:

Is this problem not showing up on another Hobbit server?  Do the two
Hobbit servers with this problem communicate at all (share data/SNMP
traffic/etc)?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid>
wrote:

This is completely beyond my knowledge, but the first place I
would look at is any hardware problems, any recent changes (obviously =) and
the similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these two
machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the
hobbit servers, with a bunch of custom RRD graphs added.  But this wasn't
done on the other hobbit server.  The only thing changed on the other hobbit
server is more html web checks added; nothing out of the ordinary.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Thu, 29 Nov 2007 15:20:50 -0500 ·

I don't know how many hosts are affected, percentage wise, but it's
definitely not every host.  And for the hosts having the problem, it's not
even the same graphs that are having the problem.

▸ quoted from Josh Luthman


On Nov 29, 2007 3:11 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Same OS at home?

Not sure if you mentioned this or not but does that weird value show up in
all RRD graphs or just a few hosts?


On 11/29/07, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Do they monitor the same devices?  I think there has to be some
similarity between the two as they had the same problem at the same time
(though this isn't 100%, it's logically the first place to look).  Hardware
isn't of much concern here as they don't communicate and the chances of both
servers going bad on the same date is simply astronomical.

Are there any kind of auto updating services running on them?

On 11/29/07, Gary Baluha < user-ae3e15c22de1@xymon.invalid > wrote:

We only have two Hobbit servers, and it is affecting both machines.
No, these two Hobbit machines do _not_ communicate with each other in any
way.

On Nov 29, 2007 2:33 PM, Josh Luthman < user-4c45a83f15cb@xymon.invalid>
wrote:

Is this problem not showing up on another Hobbit server?  Do the two
Hobbit servers with this problem communicate at all (share data/SNMP
traffic/etc)?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <
user-4c45a83f15cb@xymon.invalid> wrote:

This is completely beyond my knowledge, but the first place I
would look at is any hardware problems, any recent changes (obviously =) and
the similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these two
machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the
hobbit servers, with a bunch of custom RRD graphs added.  But this wasn't
done on the other hobbit server.  The only thing changed on the other hobbit
server is more html web checks added; nothing out of the ordinary.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it,
poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Josh Luthman · Thu, 29 Nov 2007 15:41:11 -0500 ·

Can you do a dd if=/dev/sda of=/dev/null from the disk in which the stuff is
stored?  If it is so random I'm curious to see if the fs is having
problems.  I have my money on a bug in the software or bad
disk/fs/controller.

▸ quoted from Gary Baluha


On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I don't know how many hosts are affected, percentage wise, but it's
definitely not every host.  And for the hosts having the problem, it's not
even the same graphs that are having the problem.

On Nov 29, 2007 3:11 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Same OS at home?

Not sure if you mentioned this or not but does that weird value show up
in all RRD graphs or just a few hosts?


On 11/29/07, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Do they monitor the same devices?  I think there has to be some
similarity between the two as they had the same problem at the same time
(though this isn't 100%, it's logically the first place to look).  Hardware
isn't of much concern here as they don't communicate and the chances of both
servers going bad on the same date is simply astronomical.

Are there any kind of auto updating services running on them?

On 11/29/07, Gary Baluha < user-ae3e15c22de1@xymon.invalid > wrote:

We only have two Hobbit servers, and it is affecting both machines.
No, these two Hobbit machines do _not_ communicate with each other in any
way.

On Nov 29, 2007 2:33 PM, Josh Luthman < user-4c45a83f15cb@xymon.invalid>
wrote:

Is this problem not showing up on another Hobbit server?  Do the
two Hobbit servers with this problem communicate at all (share data/SNMP
traffic/etc)?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <
user-4c45a83f15cb@xymon.invalid> wrote:

This is completely beyond my knowledge, but the first place I
would look at is any hardware problems, any recent changes (obviously =) and
the similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these
two machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the
hobbit servers, with a bunch of custom RRD graphs added.  But this wasn't
done on the other hobbit server.  The only thing changed on the other hobbit
server is more html web checks added; nothing out of the ordinary.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it,
poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Thu, 29 Nov 2007 16:24:10 -0500 ·

Unfortunately, no, I can't do this as our Hobbit server monitors production
machines.  The data directory for the rrd files are SAN-mounted, and we
haven't had disk corruption issues before with this type of setup.

The strange thing is, this only started within the past week, and
unfortunately it seems to be spreading to more and more RRD graphs.

▸ quoted from Josh Luthman


On Nov 29, 2007 3:41 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Can you do a dd if=/dev/sda of=/dev/null from the disk in which the stuff
is stored?  If it is so random I'm curious to see if the fs is having
problems.  I have my money on a bug in the software or bad
disk/fs/controller.


On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I don't know how many hosts are affected, percentage wise, but it's
definitely not every host.  And for the hosts having the problem, it's not
even the same graphs that are having the problem.

On Nov 29, 2007 3:11 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid>
wrote:

Same OS at home?

Not sure if you mentioned this or not but does that weird value show
up in all RRD graphs or just a few hosts?


On 11/29/07, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Do they monitor the same devices?  I think there has to be some
similarity between the two as they had the same problem at the same time
(though this isn't 100%, it's logically the first place to look).  Hardware
isn't of much concern here as they don't communicate and the chances of both
servers going bad on the same date is simply astronomical.

Are there any kind of auto updating services running on them?

On 11/29/07, Gary Baluha < user-ae3e15c22de1@xymon.invalid > wrote:

We only have two Hobbit servers, and it is affecting both
machines.  No, these two Hobbit machines do _not_ communicate with each
other in any way.

On Nov 29, 2007 2:33 PM, Josh Luthman <
user-4c45a83f15cb@xymon.invalid> wrote:

Is this problem not showing up on another Hobbit server?  Do the
two Hobbit servers with this problem communicate at all (share data/SNMP
traffic/etc)?

On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 29, 2007 12:01 PM, Josh Luthman <
user-4c45a83f15cb@xymon.invalid> wrote:

This is completely beyond my knowledge, but the first place
I would look at is any hardware problems, any recent changes (obviously =)
and the similarities between those two Hobbit servers issues.

That's the thing, there aren't any similarities between these
two machines.  They are different hardware, different OS, different network
segment, and different hosts being monitored.

There were some recent changes in the past month to one of the
hobbit servers, with a bunch of custom RRD graphs added.  But this wasn't
done on the other hobbit server.  The only thing changed on the other hobbit
server is more html web checks added; nothing out of the ordinary.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it,
poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it,
poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Thu, 29 Nov 2007 16:26:24 -0500 ·

Some additional information...  If I delete the RRD files that become
corrupted like this and let them rebuild, not only does it rebuild it
corrupted again, but if you click on the graph for the full history, all
four of the historical graphs show the exact same corrupted data.  That is,
it has a full 576 days worth of corrupted, invalid data, even though the RRD
file has only had data going to it for a handful of Hobbit polling cycles.

list Josh Luthman · Thu, 29 Nov 2007 16:53:13 -0500 ·

Is it possible that hosts are getting combined?

▸ quoted from Gary Baluha


On 11/29/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

Some additional information...  If I delete the RRD files that become
corrupted like this and let them rebuild, not only does it rebuild it
corrupted again, but if you click on the graph for the full history, all
four of the historical graphs show the exact same corrupted data.  That is,
it has a full 576 days worth of corrupted, invalid data, even though the RRD
file has only had data going to it for a handful of Hobbit polling cycles.

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Greg L Hubbard · Thu, 29 Nov 2007 16:06:04 -0600 ·

Sounds like your SAN is needs some love.


	From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
	Sent: Thursday, November 29, 2007 3:26 PM
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs

▸ quoted from Josh Luthman

	
	
	Some additional information...  If I delete the RRD files that
become corrupted like this and let them rebuild, not only does it
rebuild it corrupted again, but if you click on the graph for the full
history, all four of the historical graphs show the exact same corrupted
data.  That is, it has a full 576 days worth of corrupted, invalid data,
even though the RRD file has only had data going to it for a handful of
Hobbit polling cycles.

list Josh Luthman · Thu, 29 Nov 2007 17:14:58 -0500 ·

Is Hobbit the only thing running on your SAN?  Have you had any other issues
with it by chance?

You can ship it to my house for extensive testing if needed.

▸ quoted from Greg L Hubbard


On 11/29/07, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid> wrote:

 Sounds like your SAN is needs some love.

*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Thursday, November 29, 2007 3:26 PM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs

Some additional information...  If I delete the RRD files that become
corrupted like this and let them rebuild, not only does it rebuild it
corrupted again, but if you click on the graph for the full history, all
four of the historical graphs show the exact same corrupted data.  That is,
it has a full 576 days worth of corrupted, invalid data, even though the RRD
file has only had data going to it for a handful of Hobbit polling cycles.

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Fri, 30 Nov 2007 10:25:18 -0500 ·

Now this appears it is becoming a more serious problem.  It seems more and
more graphs are starting to be affected, and I still have no explanation for
what is going on here.  It also seems that almost any new graph that is
created (such as if I delete/rename/move an existing .rrd file), it
immediately starts off being corrupted. :-(

▸ quoted from Gary Baluha


On Nov 28, 2007 10:08 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I have recently noticed a strange thing happening with some of the rrd
graphs generated by Hobbit.  When you look at the graph, it looks as though
the rrd data is one one format (gauge), but the graph is generating it in a
different format (derive).  I can't seem to find any pattern to the hosts or
tests that are exhibiting this strange behavior, and it is only happening on
a handful of graphs.  I have attached a picture of one of these graphs,
since I'm not really sure how to describe it.  Note the huge numbers
displayed on the curr/min/avg/max line.

Any idea what's going on here?  When I dump the RRD file manually,
everything looks okay.  I'm running Hobbit 4.2.0 with the 2007-02-09
allinone patch (I believe the latest).  This has only happened in the past
few weeks, though when exactly it started, I don't know.  Any ideas?

list Josh Luthman · Fri, 30 Nov 2007 10:34:33 -0500 ·

Like I said before, I don't know enough to really help you and I'm shooting
in the dark to try and help!

I would try to get a whole new box together and try to replicate it.
Install the same OS/software and copy the whole ~hobbit and see if it starts
up there.

▸ quoted from Gary Baluha


On 11/30/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

Now this appears it is becoming a more serious problem.  It seems more and
more graphs are starting to be affected, and I still have no explanation for
what is going on here.  It also seems that almost any new graph that is
created (such as if I delete/rename/move an existing .rrd file), it
immediately starts off being corrupted. :-(

On Nov 28, 2007 10:08 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I have recently noticed a strange thing happening with some of the rrd
graphs generated by Hobbit.  When you look at the graph, it looks as though
the rrd data is one one format (gauge), but the graph is generating it in a
different format (derive).  I can't seem to find any pattern to the hosts or
tests that are exhibiting this strange behavior, and it is only happening on
a handful of graphs.  I have attached a picture of one of these graphs,
since I'm not really sure how to describe it.  Note the huge numbers
displayed on the curr/min/avg/max line.

Any idea what's going on here?  When I dump the RRD file manually,
everything looks okay.  I'm running Hobbit 4.2.0 with the 2007-02-09
allinone patch (I believe the latest).  This has only happened in the past
few weeks, though when exactly it started, I don't know.  Any ideas?

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Greg L Hubbard · Fri, 30 Nov 2007 09:53:37 -0600 ·

Gary,
 
This is pretty hard to decipher from "afar".
 
I think I remember you saying that when you dump the data it is always
okay?
 
Some wild thoughts:
 
a) could there be two different processes updating the same RRD files?
 
b) are all servers using the same version of rrdtool?
 
c) are the hobbitgraph files okay?  I have proven to my satisfaction
that hobbitgraph definition errors can make the graphs act funny.
 
d) if this stuff is on a SAN, can it be moved to local storage?
 
I am just "fishing."  Sometimes, when I am at my wit's end, I just
change SOMETHING to see if it makes a difference. Even WORSE can help
get me started.
 
GLH

▸ quoted from Gary Baluha



	From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
	Sent: Friday, November 30, 2007 9:25 AM
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs
	
	
	Now this appears it is becoming a more serious problem.  It
seems more and more graphs are starting to be affected, and I still have
no explanation for what is going on here.  It also seems that almost any
new graph that is created (such as if I delete/rename/move an existing
.rrd file), it immediately starts off being corrupted. :-( 
	
	
	On Nov 28, 2007 10:08 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid>
wrote:
	

		I have recently noticed a strange thing happening with
some of the rrd graphs generated by Hobbit.  When you look at the graph,
it looks as though the rrd data is one one format (gauge), but the graph
is generating it in a different format (derive).  I can't seem to find
any pattern to the hosts or tests that are exhibiting this strange
behavior, and it is only happening on a handful of graphs.  I have
attached a picture of one of these graphs, since I'm not really sure how
to describe it.  Note the huge numbers displayed on the curr/min/avg/max
line. 
		 
		Any idea what's going on here?  When I dump the RRD file
manually, everything looks okay.  I'm running Hobbit 4.2.0 with the
2007-02-09 allinone patch (I believe the latest).  This has only
happened in the past few weeks, though when exactly it started, I don't
know.  Any ideas?

list Gary Baluha · Fri, 30 Nov 2007 11:14:41 -0500 ·

▸ quoted from Greg L Hubbard

On Nov 30, 2007 10:53 AM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid> wrote:

 Gary,

This is pretty hard to decipher from "afar".

I think I remember you saying that when you dump the data it is always
okay?

Actually, it turns out this is not true.  The rrd file does indeed have the
bad data.  I just didn't notice it before, but now that it appears to be
getting worse, it is quite obvious to see the bad data.

▸ quoted from Greg L Hubbard


Some wild thoughts:

a) could there be two different processes updating the same RRD files?

I don't believe so.  The strange thing is, all of the graphs that become
corrupted have the exact same large number that is being input into the rrd
data files.

b) are all servers using the same version of rrdtool?

No.  One is running 1.2.23, the other 1.2.26.  Both have the problem.

▸ quoted from Greg L Hubbard

c) are the hobbitgraph files okay?  I have proven to my satisfaction that
hobbitgraph definition errors can make the graphs act funny.

They haven't changed since before the graphs were having this problem.

d) if this stuff is on a SAN, can it be moved to local storage?

It is on the SAN on one of the machines, and locally on the other.  I was
thinking of temporarily moving the data directory and have Hobbit regenerate
all the data from scratch.  I'm trying to avoid this, since that would mean
losing a year's worth of trend data that has proven itself very useful.
Still, if it helps me narrow down the problem, I'll consider this (and move
the data back once I get my answer).

▸ quoted from Greg L Hubbard

I am just "fishing."  Sometimes, when I am at my wit's end, I just change
SOMETHING to see if it makes a difference. Even WORSE can help get me
started.

GLH

*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Friday, November 30, 2007 9:25 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs

Now this appears it is becoming a more serious problem.  It seems more and
more graphs are starting to be affected, and I still have no explanation for
what is going on here.  It also seems that almost any new graph that is
created (such as if I delete/rename/move an existing .rrd file), it
immediately starts off being corrupted. :-(

On Nov 28, 2007 10:08 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I have recently noticed a strange thing happening with some of the rrd
graphs generated by Hobbit.  When you look at the graph, it looks as though
the rrd data is one one format (gauge), but the graph is generating it in a
different format (derive).  I can't seem to find any pattern to the hosts or
tests that are exhibiting this strange behavior, and it is only happening on
a handful of graphs.  I have attached a picture of one of these graphs,
since I'm not really sure how to describe it.  Note the huge numbers
displayed on the curr/min/avg/max line.

Any idea what's going on here?  When I dump the RRD file manually,
everything looks okay.  I'm running Hobbit 4.2.0 with the 2007-02-09
allinone patch (I believe the latest).  This has only happened in the past
few weeks, though when exactly it started, I don't know.  Any ideas?

list Gary Baluha · Fri, 30 Nov 2007 11:55:51 -0500 ·

Hmm, this is getting curiouser and curiouser.  Apparently at least _some_ of
the graphs that appear corrupted still have some valid data.  If I use the
graph zoom feature (clicking on the magnifying glass) and select certain
portions of the graph, the graph data shows up as normal.  It appears that
the problem is related to periodic data artifacts (the huge numbers) that
cause the scale of the graph to resize to show it within bounds, and this
causes the valid data to essentially disappear.

I realized this when I looked at the graph, and saw that the (curr) and
(min) data points were showing normal values.  It's just the (max) and (avg)
values that are way off, which causes the rest of the graph to be incorrect.

list Ralph Mitchell · Fri, 30 Nov 2007 11:18:45 -0600 ·

▸ quoted from Gary Baluha

On Nov 30, 2007 10:55 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

Hmm, this is getting curiouser and curiouser.  Apparently at least _some_
of the graphs that appear corrupted still have some valid data.  If I use
the graph zoom feature (clicking on the magnifying glass) and select certain
portions of the graph, the graph data shows up as normal.  It appears that
the problem is related to periodic data artifacts (the huge numbers) that
cause the scale of the graph to resize to show it within bounds, and this
causes the valid data to essentially disappear.

I realized this when I looked at the graph, and saw that the (curr) and
(min) data points were showing normal values.  It's just the (max) and (avg)
values that are way off, which causes the rest of the graph to be incorrect.

Have you tried running hobbitd_rrd with the "--debug" option??  Add it to
the various hobbitd_rrd entries in server/etc/hobbitlaunch.cfg.  I haven't
tried it myself, so I don't know how verbose it gets.  I seem to recall
Henrik saying it's OK to just kill hobbitd_rrd processes because they get
respawned.

I guess the debug output shows up in the rrd-status.log in your Hobbit logs
directory.  Is there anything interesting in that log already??  Or any
other log??

Ralph Mitchell

list Greg L Hubbard · Fri, 30 Nov 2007 12:15:51 -0600 ·

It sounds like you are zeroing in on the problem.  Based on your other
post (and this) it seems that the data is getting logged okay in the
RRD, and that data is being faithfully reproduced by the graphs.  The
problem is that the data itself has unexpected values.  So whatever is
providing that data to the RRD is either faulty, or is in turn being
misled by something else further upstream.
 
I don't remember where you said that this data was coming from.  I know
there can be a problem with "rollovers" when a signed integer is used as
a counter and it grows to the point where the sign bit flips.  This can
cause a big jump in a reading if the software cannot handle the switch
from 2,147,483,647 (hex 7FFFFFF) to the next value (hex 80000000) which
flips the sign bit for a signed 32 bit integer.  This has been a problem
in the SNMP world for YEARS.
 
There, I knew some of the computer science 101 stuff I learned in the
70's might be useful some day...

▸ quoted from Gary Baluha

 
GLH


	From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
	Sent: Friday, November 30, 2007 10:15 AM
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs
	
	
	On Nov 30, 2007 10:53 AM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid>
wrote:
	

		Gary,
		 
		This is pretty hard to decipher from "afar".
		 
		I think I remember you saying that when you dump the
data it is always okay?


	Actually, it turns out this is not true.  The rrd file does
indeed have the bad data.  I just didn't notice it before, but now that
it appears to be getting worse, it is quite obvious to see the bad data.

	
		Some wild thoughts:
		 
		a) could there be two different processes updating the
same RRD files?


	I don't believe so.  The strange thing is, all of the graphs
that become corrupted have the exact same large number that is being
input into the rrd data files.
	 

		b) are all servers using the same version of rrdtool? 


	No.  One is running 1.2.23, the other 1.2.26.  Both have the
problem.
	 

		c) are the hobbitgraph files okay?  I have proven to my
satisfaction that hobbitgraph definition errors can make the graphs act
funny.


	They haven't changed since before the graphs were having this
problem.
	 

		d) if this stuff is on a SAN, can it be moved to local
storage?


	It is on the SAN on one of the machines, and locally on the
other.  I was thinking of temporarily moving the data directory and have
Hobbit regenerate all the data from scratch.  I'm trying to avoid this,
since that would mean losing a year's worth of trend data that has
proven itself very useful.  Still, if it helps me narrow down the
problem, I'll consider this (and move the data back once I get my
answer). 
	 

		I am just "fishing."  Sometimes, when I am at my wit's
end, I just change SOMETHING to see if it makes a difference. Even WORSE
can help get me started.
		 
		GLH


			From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
			
			Sent: Friday, November 30, 2007 9:25 AM 

			To: user-ae9b8668bcde@xymon.invalid
			Subject: Re: [hobbit] strange graph behavior -
random machines & graphs
			

			Now this appears it is becoming a more serious
problem.  It seems more and more graphs are starting to be affected, and
I still have no explanation for what is going on here.  It also seems
that almost any new graph that is created (such as if I
delete/rename/move an existing .rrd file), it immediately starts off
being corrupted. :-( 
			
			
			On Nov 28, 2007 10:08 AM, Gary Baluha
<user-ae3e15c22de1@xymon.invalid> wrote:
			

				I have recently noticed a strange thing
happening with some of the rrd graphs generated by Hobbit.  When you
look at the graph, it looks as though the rrd data is one one format
(gauge), but the graph is generating it in a different format (derive).
I can't seem to find any pattern to the hosts or tests that are
exhibiting this strange behavior, and it is only happening on a handful
of graphs.  I have attached a picture of one of these graphs, since I'm
not really sure how to describe it.  Note the huge numbers displayed on
the curr/min/avg/max line. 
				 
				Any idea what's going on here?  When I
dump the RRD file manually, everything looks okay.  I'm running Hobbit
4.2.0 with the 2007-02-09 allinone patch (I believe the latest).  This
has only happened in the past few weeks, though when exactly it started,
I don't know.  Any ideas?

list Gary Baluha · Fri, 30 Nov 2007 13:27:03 -0500 ·

▸ quoted from Ralph Mitchell

On Nov 30, 2007 12:18 PM, Ralph Mitchell <user-00a5e44c48c0@xymon.invalid> wrote:

On Nov 30, 2007 10:55 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

Hmm, this is getting curiouser and curiouser.  Apparently at least
_some_ of the graphs that appear corrupted still have some valid data.  If I
use the graph zoom feature (clicking on the magnifying glass) and select
certain portions of the graph, the graph data shows up as normal.  It
appears that the problem is related to periodic data artifacts (the huge
numbers) that cause the scale of the graph to resize to show it within
bounds, and this causes the valid data to essentially disappear.

I realized this when I looked at the graph, and saw that the (curr) and
(min) data points were showing normal values.  It's just the (max) and (avg)
values that are way off, which causes the rest of the graph to be incorrect.

Have you tried running hobbitd_rrd with the "--debug" option??  Add it to
the various hobbitd_rrd entries in server/etc/hobbitlaunch.cfg.  I haven't
tried it myself, so I don't know how verbose it gets.  I seem to recall
Henrik saying it's OK to just kill hobbitd_rrd processes because they get
respawned.

I guess the debug output shows up in the rrd-status.log in your Hobbit
logs directory.  Is there anything interesting in that log already??  Or any
other log??


There wasn't anything useful in any of the logs, besides the usual stuff.  I
turned on the --debug option, and here is a sample of the data for one of
the affected machines:

 2007-11-30 13:14:07 hobbitd_rrd: Got message 562165
@@status#562165|1196446447.724393|192.168.232.110||danno|disk|1196448247|yellow||yellow|1196053505|0||0||1196446447
2007-11-30 13:14:07 startpos 343968, fillpos 343968, endpos -1
2007-11-30 13:14:07 RRD update param 00: 'rrdupdate'
2007-11-30 13:14:07 RRD update param 01:
'/var/hobbit/data/rrd/danno/disk,dev,odm.rrd'
2007-11-30 13:14:07 RRD update param 02: '-t'
2007-11-30 13:14:07 RRD update param 03: 'pct:used'
2007-11-30 13:14:07 RRD update param 04: '1196446447:0:0'

I'm afraid I don't know how to interpret all of this, unfortunately.  I get
that the "param 03" means the graph is showing "percentage [disk space]
used", and that "param 01" means it is updating that specific rrd file.  And
I remember that "-t" in "param 02" is some rrdtool flag.  But I don't know
what the numbers in "param 04" mean.  I assume the first number is the #
seconds since 1970, and the second number is the current value, but I don't
know what the last number means.  Also, I'm not sure how to interpret all of
the data in the "@@status" line.

By the way, this excerpt is from a machine that is having the graph display
problems.  In this case, the data it is receiving is normal and correct.
I'm waiting for another update when the data is incorrect.

list Gary Baluha · Fri, 30 Nov 2007 13:31:01 -0500 ·

▸ quoted from Greg L Hubbard

On Nov 30, 2007 1:15 PM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid> wrote:

 It sounds like you are zeroing in on the problem.  Based on your other
post (and this) it seems that the data is getting logged okay in the RRD,
and that data is being faithfully reproduced by the graphs.  The problem is
that the data itself has unexpected values.  So whatever is providing that
data to the RRD is either faulty, or is in turn being misled by something
else further upstream.

Yeah, I'm fairly confident now that it is the initial data being fed into
the rrd file that is faulty.  I'm still not sure what the initial "entry
point" of this bad data is, though, nor why it is happening.  I have a
feeling that once I determine where the entry point is, that will lead me to
the "why".

▸ quoted from Greg L Hubbard

I don't remember where you said that this data was coming from.  I know
there can be a problem with "rollovers" when a signed integer is used as a
counter and it grows to the point where the sign bit flips.  This can cause
a big jump in a reading if the software cannot handle the switch from
2,147,483,647 (hex 7FFFFFF) to the next value (hex 80000000) which flips the
sign bit for a signed 32 bit integer.  This has been a problem in the SNMP
world for YEARS.

Hrm, that has been something vaguely on my mind.  But I haven't really
thought of that as _the_ reason why, since I don't know why there would be
some sort of data rollover.  We're talking about load average and disk space
usage graphs that are showing invalid data.  I'm also curious why it would
have started all of a sudden, on two separate machines.  But it does seem
more and more like something like an integer rollover, or similar situation.

list Josh Luthman · Fri, 30 Nov 2007 13:38:55 -0500 ·

This is on the hobbitd incoming messages service being watched, correct
(judging by the screen capture)?  You said it started happening on two
different servers but now it is happening on additional graphs; what data
are these new graphs showing?

▸ quoted from Gary Baluha


On 11/30/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 30, 2007 1:15 PM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid> wrote:

 It sounds like you are zeroing in on the problem.  Based on your other
post (and this) it seems that the data is getting logged okay in the RRD,
and that data is being faithfully reproduced by the graphs.  The problem is
that the data itself has unexpected values.  So whatever is providing that
data to the RRD is either faulty, or is in turn being misled by something
else further upstream.

Yeah, I'm fairly confident now that it is the initial data being fed into
the rrd file that is faulty.  I'm still not sure what the initial "entry
point" of this bad data is, though, nor why it is happening.  I have a
feeling that once I determine where the entry point is, that will lead me to
the "why".

I don't remember where you said that this data was coming from.  I know
there can be a problem with "rollovers" when a signed integer is used as a
counter and it grows to the point where the sign bit flips.  This can cause
a big jump in a reading if the software cannot handle the switch from
2,147,483,647 (hex 7FFFFFF) to the next value (hex 80000000) which flips the
sign bit for a signed 32 bit integer.  This has been a problem in the SNMP
world for YEARS.

Hrm, that has been something vaguely on my mind.  But I haven't really
thought of that as _the_ reason why, since I don't know why there would be
some sort of data rollover.  We're talking about load average and disk space
usage graphs that are showing invalid data.  I'm also curious why it would
have started all of a sudden, on two separate machines.  But it does seem
more and more like something like an integer rollover, or similar situation.

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Ralph Mitchell · Fri, 30 Nov 2007 12:45:53 -0600 ·

▸ quoted from Gary Baluha

On Nov 30, 2007 12:27 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

There wasn't anything useful in any of the logs, besides the usual stuff.
I turned on the --debug option, and here is a sample of the data for one of
the affected machines:

 2007-11-30 13:14:07 hobbitd_rrd: Got message 562165
@@status#562165|1196446447.724393|192.168.232.110||danno|disk|1196448247|yellow||yellow|1196053505|0||0||1196446447
2007-11-30 13:14:07 startpos 343968, fillpos 343968, endpos -1
2007-11-30 13:14:07 RRD update param 00: 'rrdupdate'
2007-11-30 13:14:07 RRD update param 01:
'/var/hobbit/data/rrd/danno/disk,dev,odm.rrd'
2007-11-30 13:14:07 RRD update param 02: '-t'
2007-11-30 13:14:07 RRD update param 03: 'pct:used'
2007-11-30 13:14:07 RRD update param 04: '1196446447:0:0'

I'm afraid I don't know how to interpret all of this, unfortunately.  I
get that the "param 03" means the graph is showing "percentage [disk space]
used", and that "param 01" means it is updating that specific rrd file.  And
I remember that "-t" in "param 02" is some rrdtool flag.  But I don't know
what the numbers in "param 04" mean.  I assume the first number is the #
seconds since 1970, and the second number is the current value, but I don't
know what the last number means.  Also, I'm not sure how to interpret all of
the data in the "@@status" line.

By the way, this excerpt is from a machine that is having the graph
display problems.  In this case, the data it is receiving is normal and
correct.  I'm waiting for another update when the data is incorrect.

The "-t" option specifies the template to use, which is in param03 -
"pct:used".  Param 04 is the actual data to insert, starting with the
date/time in seconds (i.e. 1196446447), then zero for the "pct" value, then
zero for the "used" value.

This stuff may not help much, but maybe it will show where the data goes
weird - i.e. is hobbitd_rrd being handed bad data, or does it get corrupted
later on.

Ralph Mitchell

list Gary Baluha · Fri, 30 Nov 2007 13:46:46 -0500 ·

▸ quoted from Josh Luthman

On Nov 30, 2007 1:38 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

This is on the hobbitd incoming messages service being watched, correct
(judging by the screen capture)?  You said it started happening on two
different servers but now it is happening on additional graphs; what data
are these new graphs showing?


The log except was from rrd-status.rrd, so from the hobbitd_rrd process.  We
have two physically separate Hobbit servers, on two physically different and
non-connected networks.  Let's call one of them Hobbit A, and the other
Hobbit B.  I first noticed a few graphs on Hobbit A were showing artifacts.
This was happening for a few hosts, and of those hosts, only a few of their
trend graphs, and not always the same trend graphs across the different
hosts.  I then noticed that the same thing was happening on Hobbit B.  Now
it appears more graphs are being affected on Hobbit A (I don't know about
Hobbit B, as A has the most monitored hosts, and so I'm first concerned with
that machine).  Graphs that were previous unaffected are now being
affected.  They are all showing the same symptom.  That is, spikes of data
with the exact same number:
5177668251... (it's a very big number, at least 50 digits).

list Gary Baluha · Fri, 30 Nov 2007 13:51:01 -0500 ·

▸ quoted from Ralph Mitchell

On Nov 30, 2007 1:45 PM, Ralph Mitchell <user-00a5e44c48c0@xymon.invalid> wrote:

On Nov 30, 2007 12:27 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

There wasn't anything useful in any of the logs, besides the usual
stuff.  I turned on the --debug option, and here is a sample of the data for
one of the affected machines:

 2007-11-30 13:14:07 hobbitd_rrd: Got message 562165
@@status#562165|1196446447.724393|192.168.232.110||danno|disk|1196448247|yellow||yellow|1196053505|0||0||1196446447
2007-11-30 13:14:07 startpos 343968, fillpos 343968, endpos -1
2007-11-30 13:14:07 RRD update param 00: 'rrdupdate'
2007-11-30 13:14:07 RRD update param 01:
'/var/hobbit/data/rrd/danno/disk,dev,odm.rrd'
2007-11-30 13:14:07 RRD update param 02: '-t'
2007-11-30 13:14:07 RRD update param 03: 'pct:used'
2007-11-30 13:14:07 RRD update param 04: '1196446447:0:0'

I'm afraid I don't know how to interpret all of this, unfortunately.  I
get that the "param 03" means the graph is showing "percentage [disk space]
used", and that "param 01" means it is updating that specific rrd file.  And
I remember that "-t" in "param 02" is some rrdtool flag.  But I don't know
what the numbers in "param 04" mean.  I assume the first number is the #
seconds since 1970, and the second number is the current value, but I don't
know what the last number means.  Also, I'm not sure how to interpret all of
the data in the "@@status" line.

By the way, this excerpt is from a machine that is having the graph
display problems.  In this case, the data it is receiving is normal and
correct.  I'm waiting for another update when the data is incorrect.

The "-t" option specifies the template to use, which is in param03 -
"pct:used".  Param 04 is the actual data to insert, starting with the
date/time in seconds (i.e. 1196446447), then zero for the "pct" value,
then zero for the "used" value.

This stuff may not help much, but maybe it will show where the data goes
weird - i.e. is hobbitd_rrd being handed bad data, or does it get
corrupted later on.

That's what I'm hoping.  One other thing I noticed is that for the hosts
that have bad graphs, but where some graphs are still okay, the good graphs
have a gap of data precisely when the bad graphs have another data spike.

list Josh Luthman · Fri, 30 Nov 2007 13:57:49 -0500 ·

Another idea!  Since the two networks/Hobbit servers are unrelated theres
little in common.  What about the time?  Maybe some data from one set got
mixed in and it doesn't cooperate with the date and time?

▸ quoted from Gary Baluha


On 11/30/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Nov 30, 2007 1:45 PM, Ralph Mitchell <user-00a5e44c48c0@xymon.invalid> wrote:

On Nov 30, 2007 12:27 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

There wasn't anything useful in any of the logs, besides the usual
stuff.  I turned on the --debug option, and here is a sample of the data for
one of the affected machines:

 2007-11-30 13:14:07 hobbitd_rrd: Got message 562165
@@status#562165|1196446447.724393|192.168.232.110||danno|disk|1196448247|yellow||yellow|1196053505|0||0||1196446447
2007-11-30 13:14:07 startpos 343968, fillpos 343968, endpos -1
2007-11-30 13:14:07 RRD update param 00: 'rrdupdate'
2007-11-30 13:14:07 RRD update param 01:
'/var/hobbit/data/rrd/danno/disk,dev,odm.rrd'
2007-11-30 13:14:07 RRD update param 02: '-t'
2007-11-30 13:14:07 RRD update param 03: 'pct:used'
2007-11-30 13:14:07 RRD update param 04: '1196446447:0:0'

I'm afraid I don't know how to interpret all of this, unfortunately.
I get that the "param 03" means the graph is showing "percentage [disk
space] used", and that "param 01" means it is updating that specific rrd
file.  And I remember that "-t" in "param 02" is some rrdtool flag.  But I
don't know what the numbers in "param 04" mean.  I assume the first number
is the # seconds since 1970, and the second number is the current value, but
I don't know what the last number means.  Also, I'm not sure how to
interpret all of the data in the "@@status" line.

By the way, this excerpt is from a machine that is having the graph
display problems.  In this case, the data it is receiving is normal and
correct.  I'm waiting for another update when the data is incorrect.

The "-t" option specifies the template to use, which is in param03 -
"pct:used".  Param 04 is the actual data to insert, starting with the
date/time in seconds ( i.e. 1196446447), then zero for the "pct" value,
then zero for the "used" value.

This stuff may not help much, but maybe it will show where the data goes
weird - i.e. is hobbitd_rrd being handed bad data, or does it get
corrupted later on.

That's what I'm hoping.  One other thing I noticed is that for the hosts
that have bad graphs, but where some graphs are still okay, the good graphs
have a gap of data precisely when the bad graphs have another data spike.

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Greg L Hubbard · Fri, 30 Nov 2007 13:09:18 -0600 ·

You know what -- it almost looks like you are getting a timestamp where
another data value is suspected.  It could be that the client is not
sending data reliably, and the field positions are off by one?

▸ quoted from Gary Baluha



	From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
	Sent: Friday, November 30, 2007 12:31 PM
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs
	
	
	On Nov 30, 2007 1:15 PM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid>
wrote:
	

		It sounds like you are zeroing in on the problem.  Based
on your other post (and this) it seems that the data is getting logged
okay in the RRD, and that data is being faithfully reproduced by the
graphs.  The problem is that the data itself has unexpected values.  So
whatever is providing that data to the RRD is either faulty, or is in
turn being misled by something else further upstream.


	Yeah, I'm fairly confident now that it is the initial data being
fed into the rrd file that is faulty.  I'm still not sure what the
initial "entry point" of this bad data is, though, nor why it is
happening.  I have a feeling that once I determine where the entry point
is, that will lead me to the "why". 
	 

		I don't remember where you said that this data was
coming from.  I know there can be a problem with "rollovers" when a
signed integer is used as a counter and it grows to the point where the
sign bit flips.  This can cause a big jump in a reading if the software
cannot handle the switch from 2,147,483,647 (hex 7FFFFFF) to the next
value (hex 80000000) which flips the sign bit for a signed 32 bit
integer.  This has been a problem in the SNMP world for YEARS.


	Hrm, that has been something vaguely on my mind.  But I haven't
really thought of that as _the_ reason why, since I don't know why there
would be some sort of data rollover.  We're talking about load average
and disk space usage graphs that are showing invalid data.  I'm also
curious why it would have started all of a sudden, on two separate
machines.  But it does seem more and more like something like an integer
rollover, or similar situation.

list Gary Baluha · Fri, 30 Nov 2007 14:14:46 -0500 ·

Hmm, now this is interesting.  I have the Hobbit server (Hobbit A, from a
previous post) monitoring my work laptop (mostly so I can test out
client-side external scripts).  I have been taking my laptop home with me
this week, and I noticed that the time period while I'm *at* work, the
graphs are plotting valid data.  However, during the time that I turn my
laptop off and bring it home, to the time that I bring my laptop in the next
day and power it on, the graphs are showing the same invalid bogus data that
the other bad graphs are showing.

In other words, the rrd graphs are getting bogus data for a machine that
isn't even reporting to the Hobbit server!  Interesting, isn't it?

list Gary Baluha · Fri, 30 Nov 2007 15:07:23 -0500 ·

▸ quoted from Gary Baluha

On Nov 30, 2007 2:14 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

Hmm, now this is interesting.  I have the Hobbit server (Hobbit A, from a
previous post) monitoring my work laptop (mostly so I can test out
client-side external scripts).  I have been taking my laptop home with me
this week, and I noticed that the time period while I'm *at* work, the
graphs are plotting valid data.  However, during the time that I turn my
laptop off and bring it home, to the time that I bring my laptop in the next
day and power it on, the graphs are showing the same invalid bogus data that
the other bad graphs are showing.

In other words, the rrd graphs are getting bogus data for a machine that
isn't even reporting to the Hobbit server!  Interesting, isn't it?

I'm definitely on to something with this.  I intentionally stopped the
Hobbit client process on one of the machines that has the bad RRD graphs for
about 20 minutes, and then started it back up.  Once the client reported the
latest data back, the RRD graph had another spike in it!

The other interesting thing is, the hobbitd-rrd --debug logging (
rrd-status.log) does *not* show any abnormal data.  It appears that Hobbit
is logging valid data to "rrdupdate".  So the bogus data appears to be
down-stream of this.

So it seems these data spikes *do* correspond to something: they correspond
to a lack of data reported back from the clients.  Furthermore, when I do an
rrd dump, I can see the bogus data in the "secondary_value" field:

-----Start of RRD dump-----
<!-- Round Robin Archives -->   <rra>
                <cf> AVERAGE </cf>
                <pdp_per_row> 1 </pdp_per_row> <!-- 300 seconds -->

                <params>
                <xff> 5.0000000000e-01 </xff>
                </params>
                <cdp_prep>
                        <ds>
                        <primary_value> 2.6110000000e+01 </primary_value>
                        <secondary_value> 5.1776682516e+170</secondary_value>
                        <value> 5.1776682516e+170 </value>
                        <unknown_datapoints> 0 </unknown_datapoints>
                        </ds>
                </cdp_prep>
                <database>
                        <!-- 2007-11-28 15:05:00 EST / 1196280300 -->
<row><v> 5.1776682516e+170 </v></row>
-----SNIP-----
-----End of RRD dump-----

The number 5.1776682516e+170 corresponds to the "517768..." large number
that the GPRINT portion of the rrd graphs are displaying.

Anyone have any ideas of what else to turn logging up on?

list Gary Baluha · Mon, 3 Dec 2007 13:55:06 -0500 ·

I'm almost certain now that this is a problem when Hobbit/rrdtool doesn't
receive data when it is expecting it.  In short, it seems that when a lack
of data occurs, instead of assigning NaN or 0, this huge number is inserted
into rrd database instead.  I'm still not sure where this number is
generated from, though.

list Josh Luthman · Mon, 3 Dec 2007 14:06:14 -0500 ·

Possibly the maximum value for that data entry type?

Josh

▸ quoted from Gary Baluha


On 12/3/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I'm almost certain now that this is a problem when Hobbit/rrdtool doesn't
receive data when it is expecting it.  In short, it seems that when a lack
of data occurs, instead of assigning NaN or 0, this huge number is inserted
into rrd database instead.  I'm still not sure where this number is
generated from, though.

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Wed, 5 Dec 2007 11:53:26 -0500 ·

I am now completely convinced that the strange behavior of the graphs is due
to some bad data getting inserted into the .rrd database files.  The bad
data is always the same value: 5.1776682516e+170.  That's what the value
looks like when you do an rrddump on the .rrd database file.

I still have no idea where this value is coming from, but I have at least
determined how to fix these graphs.  I'm working on a script to do this, but
for now, I manually do an rrddump of the file, change all bogus values to
NaN (basically, searching for "e+1", since none of the values I trend
generally get that large, so I know these entries are just averaged values
of correct data and the 5.17... number), and then do an rrdrestore from the
modified xml file.

It would be nice to determine where this problem is coming from, though.

list Thomas Kern · Wed, 5 Dec 2007 11:57:42 -0500 ·

Could you give a short example of a bogus and a changed (NaN) entry,
just in case that is also happening to some of my data files?
 

/Thomas Kern
/XXX-XXX-XXXX (O)
/XXX-XXX-XXXX (M)

▸ quoted from Gary Baluha


 
	From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
	Sent: Wednesday, December 05, 2007 11:53 AM
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs
	
	
	I am now completely convinced that the strange behavior of the
graphs is due to some bad data getting inserted into the .rrd database
files.  The bad data is always the same value: 5.1776682516e+170.
That's what the value looks like when you do an rrddump on the .rrd
database file. 
	
	I still have no idea where this value is coming from, but I have
at least determined how to fix these graphs.  I'm working on a script to
do this, but for now, I manually do an rrddump of the file, change all
bogus values to NaN (basically, searching for "e+1", since none of the
values I trend generally get that large, so I know these entries are
just averaged values of correct data and the 5.17... number), and then
do an rrdrestore from the modified xml file.
	
	It would be nice to determine where this problem is coming from,
though.

list Gary Baluha · Wed, 5 Dec 2007 14:05:51 -0500 ·

cd hobbit_data_dir/host_machine
rrdtool dump clock.rrd > clock.xml

I know any number that shows up greater than "e+1nn" is bogus, so I search
for "e+1".

One of several bogus data lines:
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v>
3.9551632477e+169</v></row>

Same line, changed to NaN (repeat for all affected lines):
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v> NaN </v></row>

rrdtool restore clock.xml clock.rrd

▸ quoted from Thomas Kern


On Dec 5, 2007 11:57 AM, Kern, Thomas <user-f1ebafb19faf@xymon.invalid> wrote:

 Could you give a short example of a bogus and a changed (NaN) entry, just
in case that is also happening to some of my data files?


/Thomas Kern
/XXX-XXX-XXXX (O)
/XXX-XXX-XXXX (M)


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Wednesday, December 05, 2007 11:53 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs

I am now completely convinced that the strange behavior of the graphs is
due to some bad data getting inserted into the .rrd database files.  The bad
data is always the same value: 5.1776682516e+170.  That's what the value
looks like when you do an rrddump on the .rrd database file.

I still have no idea where this value is coming from, but I have at least
determined how to fix these graphs.  I'm working on a script to do this, but
for now, I manually do an rrddump of the file, change all bogus values to
NaN (basically, searching for "e+1", since none of the values I trend
generally get that large, so I know these entries are just averaged values
of correct data and the 5.17... number), and then do an rrdrestore from
the modified xml file.

It would be nice to determine where this problem is coming from, though.

list Gary Baluha · Wed, 5 Dec 2007 15:55:38 -0500 ·

I wrote a script to clean up these bogus values.  Of course, if there are
trend graphs where numbers large enough for NNNe+1NN to be valid, the script
will have unexpected results.  To run the script, you need to "cd" into the
directory with the rrd files to be fixed.

▸ quoted from Gary Baluha


On Dec 5, 2007 2:05 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

cd hobbit_data_dir/host_machine
rrdtool dump clock.rrd > clock.xml

I know any number that shows up greater than "e+1nn" is bogus, so I search
for "e+1".

One of several bogus data lines:
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v> 3.9551632477e+169</v></row>

Same line, changed to NaN (repeat for all affected lines):
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v> NaN </v></row>

rrdtool restore clock.xml clock.rrd


On Dec 5, 2007 11:57 AM, Kern, Thomas <user-f1ebafb19faf@xymon.invalid> wrote:

 Could you give a short example of a bogus and a changed (NaN) entry,
just in case that is also happening to some of my data files?


/Thomas Kern
/XXX-XXX-XXXX (O)
/XXX-XXX-XXXX (M)


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Wednesday, December 05, 2007 11:53 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines &
graphs

I am now completely convinced that the strange behavior of the graphs is
due to some bad data getting inserted into the .rrd database files.  The bad
data is always the same value: 5.1776682516e+170.  That's what the value
looks like when you do an rrddump on the .rrd database file.

I still have no idea where this value is coming from, but I have at
least determined how to fix these graphs.  I'm working on a script to do
this, but for now, I manually do an rrddump of the file, change all bogus
values to NaN (basically, searching for "e+1", since none of the values I
trend generally get that large, so I know these entries are just averaged
values of correct data and the 5.17... number), and then do an
rrdrestore from the modified xml file.

It would be nice to determine where this problem is coming from, though.

Attachments (1)

attachment.sh application/octet-stream · 1.0 KB

list Josh Luthman · Wed, 5 Dec 2007 18:01:09 -0500 ·

Very cool, Gary.  Awesome to the max!

Thank you very much for sharing your experience and a fix!

Just for the record - when using this double check the rrdtool dir and the
user:group permissions - not everyone uses those same settings!

Again, thanks a lot!

▸ quoted from Gary Baluha


On 12/5/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I wrote a script to clean up these bogus values.  Of course, if there are
trend graphs where numbers large enough for NNNe+1NN to be valid, the script
will have unexpected results.  To run the script, you need to "cd" into the
directory with the rrd files to be fixed.

On Dec 5, 2007 2:05 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

cd hobbit_data_dir/host_machine
rrdtool dump clock.rrd > clock.xml

I know any number that shows up greater than "e+1nn" is bogus, so I
search for "e+1".

One of several bogus data lines:
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v> 3.9551632477e+169</v></row>

Same line, changed to NaN (repeat for all affected lines):
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v> NaN </v></row>

rrdtool restore clock.xml clock.rrd


On Dec 5, 2007 11:57 AM, Kern, Thomas <user-f1ebafb19faf@xymon.invalid > wrote:

 Could you give a short example of a bogus and a changed (NaN) entry,
just in case that is also happening to some of my data files?


/Thomas Kern
/XXX-XXX-XXXX (O)
/XXX-XXX-XXXX (M)


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Wednesday, December 05, 2007 11:53 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines &
graphs

I am now completely convinced that the strange behavior of the graphs
is due to some bad data getting inserted into the .rrd database files.  The
bad data is always the same value: 5.1776682516e+170.  That's what the
value looks like when you do an rrddump on the .rrd database file.

I still have no idea where this value is coming from, but I have at
least determined how to fix these graphs.  I'm working on a script to do
this, but for now, I manually do an rrddump of the file, change all bogus
values to NaN (basically, searching for "e+1", since none of the values I
trend generally get that large, so I know these entries are just averaged
values of correct data and the 5.17... number), and then do an
rrdrestore from the modified xml file.

It would be nice to determine where this problem is coming from,
though.

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Thu, 6 Dec 2007 10:45:40 -0500 ·

▸ quoted from Josh Luthman

On Dec 5, 2007 6:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Very cool, Gary.  Awesome to the max!

Thank you very much for sharing your experience and a fix!

Just for the record - when using this double check the rrdtool dir and the
user:group permissions - not everyone uses those same settings!

Yeah, I thought of that after I uploaded the script.  It's a simple enough
script that people can just modify the appropriate sections of code to suit
their environments.

▸ quoted from Josh Luthman

Again, thanks a lot!

On 12/5/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I wrote a script to clean up these bogus values.  Of course, if there
are trend graphs where numbers large enough for NNNe+1NN to be valid, the
script will have unexpected results.  To run the script, you need to "cd"
into the directory with the rrd files to be fixed.

On Dec 5, 2007 2:05 PM, Gary Baluha < user-ae3e15c22de1@xymon.invalid> wrote:

cd hobbit_data_dir/host_machine
rrdtool dump clock.rrd > clock.xml

I know any number that shows up greater than "e+1nn" is bogus, so I
search for "e+1".

One of several bogus data lines:
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v>


3.9551632477e+169 </v></row>

▸ signature


Same line, changed to NaN (repeat for all affected lines):
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v> NaN </v></row>

rrdtool restore clock.xml clock.rrd


On Dec 5, 2007 11:57 AM, Kern, Thomas < user-f1ebafb19faf@xymon.invalid >

▸ quoted from Josh Luthman

wrote:

 Could you give a short example of a bogus and a changed (NaN)
entry, just in case that is also happening to some of my data files?


/Thomas Kern
/XXX-XXX-XXXX (O)
/XXX-XXX-XXXX (M)


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid ]
*Sent:* Wednesday, December 05, 2007 11:53 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines &
graphs

I am now completely convinced that the strange behavior of the
graphs is due to some bad data getting inserted into the .rrd database
files.  The bad data is always the same value: 5.1776682516e+170.
That's what the value looks like when you do an rrddump on the .rrd database
file.

I still have no idea where this value is coming from, but I have at
least determined how to fix these graphs.  I'm working on a script to do
this, but for now, I manually do an rrddump of the file, change all bogus
values to NaN (basically, searching for "e+1", since none of the values I
trend generally get that large, so I know these entries are just averaged
values of correct data and the 5.17... number), and then do an
rrdrestore from the modified xml file.

It would be nice to determine where this problem is coming from,
though.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Josh Luthman · Thu, 6 Dec 2007 11:02:31 -0500 ·

You're fine - I just wanted it noted in the mailing list archive =)

▸ quoted from Gary Baluha


On 12/6/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

On Dec 5, 2007 6:01 PM, Josh Luthman <user-4c45a83f15cb@xymon.invalid> wrote:

Very cool, Gary.  Awesome to the max!

Thank you very much for sharing your experience and a fix!

Just for the record - when using this double check the rrdtool dir and
the user:group permissions - not everyone uses those same settings!

Yeah, I thought of that after I uploaded the script.  It's a simple enough
script that people can just modify the appropriate sections of code to suit
their environments.

Again, thanks a lot!

On 12/5/07, Gary Baluha <user-ae3e15c22de1@xymon.invalid > wrote:

I wrote a script to clean up these bogus values.  Of course, if there
are trend graphs where numbers large enough for NNNe+1NN to be valid, the
script will have unexpected results.  To run the script, you need to "cd"
into the directory with the rrd files to be fixed.

On Dec 5, 2007 2:05 PM, Gary Baluha < user-ae3e15c22de1@xymon.invalid> wrote:

cd hobbit_data_dir/host_machine
rrdtool dump clock.rrd > clock.xml

I know any number that shows up greater than "e+1nn" is bogus, so I
search for "e+1".

One of several bogus data lines:
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v>
3.9551632477e+169 </v></row>

Same line, changed to NaN (repeat for all affected lines):
<!-- 2007-11-26 19:00:00 EST / 1196121600 --> <row><v> NaN
</v></row>

rrdtool restore clock.xml clock.rrd


On Dec 5, 2007 11:57 AM, Kern, Thomas < user-f1ebafb19faf@xymon.invalid >
wrote:

 Could you give a short example of a bogus and a changed (NaN)
entry, just in case that is also happening to some of my data files?


/Thomas Kern
/XXX-XXX-XXXX (O)
/XXX-XXX-XXXX (M)


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid ]
*Sent:* Wednesday, December 05, 2007 11:53 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines &
graphs

I am now completely convinced that the strange behavior of the
graphs is due to some bad data getting inserted into the .rrd database
files.  The bad data is always the same value: 5.1776682516e+170.
That's what the value looks like when you do an rrddump on the .rrd database
file.

I still have no idea where this value is coming from, but I have
at least determined how to fix these graphs.  I'm working on a script to do
this, but for now, I manually do an rrddump of the file, change all bogus
values to NaN (basically, searching for "e+1", since none of the values I
trend generally get that large, so I know these entries are just averaged
values of correct data and the 5.17... number), and then do an
rrdrestore from the modified xml file.

It would be nice to determine where this problem is coming from,
though.

--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Fri, 7 Dec 2007 15:13:00 -0500 ·

This problem is definitely due to a lack of data.
There are two hobbitd_rrd processes that run, as we all know.  Why are there
two processes, again?

list Greg L Hubbard · Fri, 7 Dec 2007 14:19:49 -0600 ·

One listens to the "data" channel and the other listens to the "status"
channel?

▸ quoted from Gary Baluha



	From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
	Sent: Friday, December 07, 2007 2:13 PM
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs
	
	
	This problem is definitely due to a lack of data.
	There are two hobbitd_rrd processes that run, as we all know.
Why are there two processes, again?

list Gary Baluha · Tue, 11 Dec 2007 09:42:49 -0500 ·

I'm curious, is anyone else having this issue?  For reference, I have
attached a typical screen shot of one of the broken graphs; I think it's a
better representation than an earlier screen shot I provided.  If it's some
bug in hobbit and/or rrdtool, I would think I'm not the only one with this
problem.

Attachments (1)

attachment.png image/png · 17.4 KB

list Gary Baluha · Wed, 12 Dec 2007 10:14:48 -0500 ·

To try and narrow down the problem some more, I got rid of all the old,
unused rrd data files and directories in Hobbit's /data directory, and
removed all the bogus data from all of them.  Since the problem re-occurred,
I can fairly confidently say it has nothing to do with having too many rrd
files (hey, you never know).

▸ quoted from Gary Baluha


On Dec 11, 2007 9:42 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I'm curious, is anyone else having this issue?  For reference, I have
attached a typical screen shot of one of the broken graphs; I think it's a
better representation than an earlier screen shot I provided.  If it's some
bug in hobbit and/or rrdtool, I would think I'm not the only one with this
problem.

list Galen Johnson · Wed, 12 Dec 2007 10:23:15 -0500 ·

Gary,

 
Are you cross-posting to the rrd lists as well?  Someone there may be
better able to tell you where to look..or give some idea of something
Henrik may need to check in the code.  Most of us are rrd consumers only
and even fewer see the same issue.  I wish I were since I like debugging
troublesome issues...much more fun than just clicking "next" when
prompted or './configure && make && make install'.

 
=G=

▸ quoted from Gary Baluha

From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
Sent: Wednesday, December 12, 2007 10:15 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] strange graph behavior - random machines & graphs

To try and narrow down the problem some more, I got rid of all the old,
unused rrd data files and directories in Hobbit's /data directory, and
removed all the bogus data from all of them.  Since the problem
re-occurred, I can fairly confidently say it has nothing to do with
having too many rrd files (hey, you never know). 

On Dec 11, 2007 9:42 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I'm curious, is anyone else having this issue?  For reference, I have
attached a typical screen shot of one of the broken graphs; I think it's
a better representation than an earlier screen shot I provided.  If it's
some bug in hobbit and/or rrdtool, I would think I'm not the only one
with this problem.

list Josh Luthman · Wed, 12 Dec 2007 10:28:35 -0500 ·

I think a lot of us are thinking "speak for yourself, Gary" =)

I, too, enjoy a good challenge some times...but not when I'm too busy with
other tasks - almost all the time =(

▸ quoted from Galen Johnson


On 12/12/07, Galen Johnson <user-87f955643e3d@xymon.invalid> wrote:

 Gary,


Are you cross-posting to the rrd lists as well?  Someone there may be
better able to tell you where to look..or give some idea of something Henrik
may need to check in the code.  Most of us are rrd consumers only and even
fewer see the same issue.  I wish I were since I like debugging troublesome
issues…much more fun than just clicking "next" when prompted or './configure
&& make && make install'.


=G=


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Wednesday, December 12, 2007 10:15 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs


To try and narrow down the problem some more, I got rid of all the old,
unused rrd data files and directories in Hobbit's /data directory, and
removed all the bogus data from all of them.  Since the problem re-occurred,
I can fairly confidently say it has nothing to do with having too many rrd
files (hey, you never know).

On Dec 11, 2007 9:42 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I'm curious, is anyone else having this issue?  For reference, I have
attached a typical screen shot of one of the broken graphs; I think it's a
better representation than an earlier screen shot I provided.  If it's some
bug in hobbit and/or rrdtool, I would think I'm not the only one with this
problem.

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Galen Johnson · Wed, 12 Dec 2007 10:45:05 -0500 ·

No, no...you misunderstood...I'm not saying user-31c7974c1875@xymon.invalid asking if
he was pursuing that path as well.

 
=G=

▸ quoted from Josh Luthman

From: Josh Luthman [mailto:user-4c45a83f15cb@xymon.invalid] 
Sent: Wednesday, December 12, 2007 10:29 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] strange graph behavior - random machines & graphs

I think a lot of us are thinking "speak for yourself, Gary" =)

I, too, enjoy a good challenge some times...but not when I'm too busy
with other tasks - almost all the time =(

On 12/12/07, Galen Johnson <user-87f955643e3d@xymon.invalid> wrote:

Gary,

Are you cross-posting to the rrd lists as well?  Someone there may be
better able to tell you where to look..or give some idea of something
Henrik may need to check in the code.  Most of us are rrd consumers only
and even fewer see the same issue.  I wish I were since I like debugging
troublesome issues...much more fun than just clicking "next" when
prompted or './configure && make && make install'.

=G=

From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
Sent: Wednesday, December 12, 2007 10:15 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] strange graph behavior - random machines & graphs

To try and narrow down the problem some more, I got rid of all the old,
unused rrd data files and directories in Hobbit's /data directory, and
removed all the bogus data from all of them.  Since the problem
re-occurred, I can fairly confidently say it has nothing to do with
having too many rrd files (hey, you never know). 

On Dec 11, 2007 9:42 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I'm curious, is anyone else having this issue?  For reference, I have
attached a typical screen shot of one of the broken graphs; I think it's
a better representation than an earlier screen shot I provided.  If it's
some bug in hobbit and/or rrdtool, I would think I'm not the only one
with this problem. 

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly. 
--- Henry Spencer

list Josh Luthman · Wed, 12 Dec 2007 11:01:55 -0500 ·

I see.

So from what I can tell we're up to the developer or an aid to fix this
issue?

Gary did provide a workaround, still...

Josh

On 12/12/07, Galen Johnson <user-87f955643e3d@xymon.invalid> wrote:

 No, no…you misunderstood…I'm not saying that at all…just asking if he was

▸ quoted from Galen Johnson

pursuing that path as well.


=G=


*From:* Josh Luthman [mailto:user-4c45a83f15cb@xymon.invalid]
*Sent:* Wednesday, December 12, 2007 10:29 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs


I think a lot of us are thinking "speak for yourself, Gary" =)

I, too, enjoy a good challenge some times...but not when I'm too busy with
other tasks - almost all the time =(

On 12/12/07, *Galen Johnson* <user-87f955643e3d@xymon.invalid> wrote:

Gary,


Are you cross-posting to the rrd lists as well?  Someone there may be
better able to tell you where to look..or give some idea of something Henrik
may need to check in the code.  Most of us are rrd consumers only and even
fewer see the same issue.  I wish I were since I like debugging troublesome
issues…much more fun than just clicking "next" when prompted or './configure
&& make && make install'.


=G=


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Wednesday, December 12, 2007 10:15 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs


To try and narrow down the problem some more, I got rid of all the old,
unused rrd data files and directories in Hobbit's /data directory, and
removed all the bogus data from all of them.  Since the problem re-occurred,
I can fairly confidently say it has nothing to do with having too many rrd
files (hey, you never know).

On Dec 11, 2007 9:42 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I'm curious, is anyone else having this issue?  For reference, I have
attached a typical screen shot of one of the broken graphs; I think it's a
better representation than an earlier screen shot I provided.  If it's some
bug in hobbit and/or rrdtool, I would think I'm not the only one with this
problem.


--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Wed, 12 Dec 2007 13:45:35 -0500 ·

▸ quoted from Josh Luthman

 No, no…you misunderstood…I'm not saying that at all…just asking if he was
pursuing that path as well.

Yes, I have posted a thread on the RRD forum (though I'm not doing a full
cross-posting).  I haven't gotten any responses back yet, and besides I'm
not convinced it is a problem with RRD (or for that matter, Hobbit).

My main reason for updating this thread in the hobbit list is because much
of it is Hobbit related (even if Hobbit itself isn't the cause).

▸ quoted from Josh Luthman

=G=


*From:* Josh Luthman [mailto:user-4c45a83f15cb@xymon.invalid]
*Sent:* Wednesday, December 12, 2007 10:29 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs


I think a lot of us are thinking "speak for yourself, Gary" =)

I, too, enjoy a good challenge some times...but not when I'm too busy with
other tasks - almost all the time =(

Heheh, since I seem to be the only one with this problem so far (or at
least, the only one to post about it), I'm doing most of the responding to
my own thread.  Oh well, I've learned quite a bit about how Hobbit and RRD
both work.

This is certainly a good challenge, since so many factors are involved in
this problem.  Unfortunately for me, we have people here that might try to
represent this problem as a reason to get off Hobbit and use an inferior
pay-ware monitoring solution.  Thankfully I figured out a work-around, so
I'm hoping that will buy me enough time to finally figure this one out.

...And hopefully in the mean time, I don't cause too much irrelevant noise
on this mailing list, responding to my own problem ;-)

▸ quoted from Galen Johnson

 On 12/12/07, *Galen Johnson* <user-87f955643e3d@xymon.invalid> wrote:

Gary,


Are you cross-posting to the rrd lists as well?  Someone there may be
better able to tell you where to look..or give some idea of something Henrik
may need to check in the code.  Most of us are rrd consumers only and even
fewer see the same issue.  I wish I were since I like debugging troublesome
issues…much more fun than just clicking "next" when prompted or './configure
&& make && make install'.


=G=


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Wednesday, December 12, 2007 10:15 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs


To try and narrow down the problem some more, I got rid of all the old,
unused rrd data files and directories in Hobbit's /data directory, and
removed all the bogus data from all of them.  Since the problem re-occurred,
I can fairly confidently say it has nothing to do with having too many rrd
files (hey, you never know).

On Dec 11, 2007 9:42 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I'm curious, is anyone else having this issue?  For reference, I have
attached a typical screen shot of one of the broken graphs; I think it's a
better representation than an earlier screen shot I provided.  If it's some
bug in hobbit and/or rrdtool, I would think I'm not the only one with this
problem.


--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Galen Johnson · Wed, 12 Dec 2007 14:06:02 -0500 ·

We appreciate that...too many list users forget to reply with their fix
(this is a global phenomena) and the debugging steps they did so the
next guy that goes looking is left wondering.

▸ quoted from Gary Baluha


 
=G=

 
From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
Sent: Wednesday, December 12, 2007 1:46 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] strange graph behavior - random machines & graphs

 
	No, no...you misunderstood...I'm not saying user-31c7974c1875@xymon.invalid
asking if he was pursuing that path as well.


Yes, I have posted a thread on the RRD forum (though I'm not doing a
full cross-posting).  I haven't gotten any responses back yet, and
besides I'm not convinced it is a problem with RRD (or for that matter,
Hobbit). 

My main reason for updating this thread in the hobbit list is because
much of it is Hobbit related (even if Hobbit itself isn't the cause).
 

	=G=

	 
	From: Josh Luthman [mailto:user-4c45a83f15cb@xymon.invalid] 
	Sent: Wednesday, December 12, 2007 10:29 AM

	
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs

	 
	I think a lot of us are thinking "speak for yourself, Gary" =)
	
	I, too, enjoy a good challenge some times...but not when I'm too
busy with other tasks - almost all the time =(

Heheh, since I seem to be the only one with this problem so far (or at
least, the only one to post about it), I'm doing most of the responding
to my own thread.  Oh well, I've learned quite a bit about how Hobbit
and RRD both work. 

This is certainly a good challenge, since so many factors are involved
in this problem.  Unfortunately for me, we have people here that might
try to represent this problem as a reason to get off Hobbit and use an
inferior pay-ware monitoring solution.  Thankfully I figured out a
work-around, so I'm hoping that will buy me enough time to finally
figure this one out. 

...And hopefully in the mean time, I don't cause too much irrelevant
noise on this mailing list, responding to my own problem ;-)
 

	On 12/12/07, Galen Johnson <user-87f955643e3d@xymon.invalid> wrote:

	Gary,

	 
	Are you cross-posting to the rrd lists as well?  Someone there
may be better able to tell you where to look..or give some idea of
something Henrik may need to check in the code.  Most of us are rrd
consumers only and even fewer see the same issue.  I wish I were since I
like debugging troublesome issues...much more fun than just clicking
"next" when prompted or './configure && make && make install'.

	 
	=G=

	 
	From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
	Sent: Wednesday, December 12, 2007 10:15 AM
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs

	 
	To try and narrow down the problem some more, I got rid of all
the old, unused rrd data files and directories in Hobbit's /data
directory, and removed all the bogus data from all of them.  Since the
problem re-occurred, I can fairly confidently say it has nothing to do
with having too many rrd files (hey, you never know). 

	On Dec 11, 2007 9:42 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid>
wrote:

	I'm curious, is anyone else having this issue?  For reference, I
have attached a typical screen shot of one of the broken graphs; I think
it's a better representation than an earlier screen shot I provided.  If
it's some bug in hobbit and/or rrdtool, I would think I'm not the only
one with this problem. 

	 
	-- 
	Josh Luthman
	Office: XXX-XXX-XXXX
	Direct: XXX-XXX-XXXX
	XXXX Wayne St
	Suite XXXX
	Troy, OH XXXXX
	
	Those who don't understand UNIX are condemned to reinvent it,
poorly. 
	--- Henry Spencer

list Josh Luthman · Wed, 12 Dec 2007 14:42:32 -0500 ·

I'm with Galen.  Thanks for the documentation =)

▸ quoted from Galen Johnson


On 12/12/07, Galen Johnson <user-87f955643e3d@xymon.invalid> wrote:

 We appreciate that…too many list users forget to reply with their fix
(this is a global phenomena) and the debugging steps they did so the next
guy that goes looking is left wondering.


=G=


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Wednesday, December 12, 2007 1:46 PM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs


No, no…you misunderstood…I'm not saying that at all…just asking if he was
pursuing that path as well.


Yes, I have posted a thread on the RRD forum (though I'm not doing a full
cross-posting).  I haven't gotten any responses back yet, and besides I'm
not convinced it is a problem with RRD (or for that matter, Hobbit).

My main reason for updating this thread in the hobbit list is because much
of it is Hobbit related (even if Hobbit itself isn't the cause).


=G=


*From:* Josh Luthman [mailto:user-4c45a83f15cb@xymon.invalid]
*Sent:* Wednesday, December 12, 2007 10:29 AM


*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs


I think a lot of us are thinking "speak for yourself, Gary" =)

I, too, enjoy a good challenge some times...but not when I'm too busy with
other tasks - almost all the time =(

 Heheh, since I seem to be the only one with this problem so far (or at
least, the only one to post about it), I'm doing most of the responding to
my own thread.  Oh well, I've learned quite a bit about how Hobbit and RRD
both work.

This is certainly a good challenge, since so many factors are involved in
this problem.  Unfortunately for me, we have people here that might try to
represent this problem as a reason to get off Hobbit and use an inferior
pay-ware monitoring solution.  Thankfully I figured out a work-around, so
I'm hoping that will buy me enough time to finally figure this one out.

...And hopefully in the mean time, I don't cause too much irrelevant noise
on this mailing list, responding to my own problem ;-)


  On 12/12/07, *Galen Johnson* <user-87f955643e3d@xymon.invalid> wrote:

Gary,


Are you cross-posting to the rrd lists as well?  Someone there may be
better able to tell you where to look..or give some idea of something Henrik
may need to check in the code.  Most of us are rrd consumers only and even
fewer see the same issue.  I wish I were since I like debugging troublesome
issues…much more fun than just clicking "next" when prompted or './configure
&& make && make install'.


=G=


*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Wednesday, December 12, 2007 10:15 AM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs


To try and narrow down the problem some more, I got rid of all the old,
unused rrd data files and directories in Hobbit's /data directory, and
removed all the bogus data from all of them.  Since the problem re-occurred,
I can fairly confidently say it has nothing to do with having too many rrd
files (hey, you never know).

On Dec 11, 2007 9:42 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

I'm curious, is anyone else having this issue?  For reference, I have
attached a typical screen shot of one of the broken graphs; I think it's a
better representation than an earlier screen shot I provided.  If it's some
bug in hobbit and/or rrdtool, I would think I'm not the only one with this
problem.


--
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

-- 
Josh Luthman
Office: XXX-XXX-XXXX
Direct: XXX-XXX-XXXX
XXXX Wayne St
Suite XXXX
Troy, OH XXXXX

Those who don't understand UNIX are condemned to reinvent it, poorly.
--- Henry Spencer

list Gary Baluha · Wed, 19 Dec 2007 09:59:39 -0500 ·

It's interesting that it seems the CPU Load and Users and Processes graphs
are the graphs that are most likely to have this strange corruption.  I have
also seen it on a few Disk graphs, but not nearly as many as the other two
graphs.  Interestingly, the CPU Utilization, Network I/O, and TCP Connection
Times graphs have _never_ had this corruption.  I'd also like to say the
Memory Utilization graph hasn't had this issue either, though I can't recall
with complete certainty that that is the case.

I wonder what the main difference between the 3 graphs that do have the
issue is, and the 3 (possibly 4) graphs that have never exhibited this
issue.  There must be some physical difference, as I can't imagine it is all
do purely to luck...

list Gary Baluha · Mon, 7 Apr 2008 14:14:44 -0400 ·

I thought I'd revisit this issue again.  A new thought has occurred to
me...  Where does Hobbit generate the RRD files?  I wonder what parameters
Hobbit is using to pass to rrdtool, and if something there might be acting
funny with some of the data I'm providing to that Hobbit module.

▸ quoted from Gary Baluha


On Wed, Dec 19, 2007 at 10:59 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

It's interesting that it seems the CPU Load and Users and Processes graphs
are the graphs that are most likely to have this strange corruption.  I have
also seen it on a few Disk graphs, but not nearly as many as the other two
graphs.  Interestingly, the CPU Utilization, Network I/O, and TCP Connection
Times graphs have _never_ had this corruption.  I'd also like to say the
Memory Utilization graph hasn't had this issue either, though I can't recall
with complete certainty that that is the case.

I wonder what the main difference between the 3 graphs that do have the
issue is, and the 3 (possibly 4) graphs that have never exhibited this
issue.  There must be some physical difference, as I can't imagine it is all


due purely to luck...

list Greg L Hubbard · Mon, 7 Apr 2008 13:27:00 -0500 ·

If you look through the source, the RRD support modules are easy to
spot.  If you are using custom graphs, then you need to review the ncv
method, or the "roll your own" method.
 
But, since "garbage in -> garbage out" you might be on the right path.

▸ quoted from Gary Baluha

 
GLH


	From: Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid] 
	Sent: Monday, April 07, 2008 1:15 PM
	To: user-ae9b8668bcde@xymon.invalid
	Subject: Re: [hobbit] strange graph behavior - random machines &
graphs
	
	
	I thought I'd revisit this issue again.  A new thought has
occurred to me...  Where does Hobbit generate the RRD files?  I wonder
what parameters Hobbit is using to pass to rrdtool, and if something
there might be acting funny with some of the data I'm providing to that
Hobbit module.
	
	
	On Wed, Dec 19, 2007 at 10:59 AM, Gary Baluha
<user-ae3e15c22de1@xymon.invalid> wrote:
	

		It's interesting that it seems the CPU Load and Users
and Processes graphs are the graphs that are most likely to have this
strange corruption.  I have also seen it on a few Disk graphs, but not
nearly as many as the other two graphs.  Interestingly, the CPU
Utilization, Network I/O, and TCP Connection Times graphs have _never_
had this corruption.  I'd also like to say the Memory Utilization graph
hasn't had this issue either, though I can't recall with complete
certainty that that is the case. 
		
		I wonder what the main difference between the 3 graphs
that do have the issue is, and the 3 (possibly 4) graphs that have never
exhibited this issue.  There must be some physical difference, as I
can't imagine it is all due purely to luck...

list Gary Baluha · Mon, 7 Apr 2008 14:35:46 -0400 ·

In my case, it seems the "garbage" that is going into the graphs is caused
by a lack of data, rather than actual bad data.  I'm specifically wondering
if there's some time interval mix-up that is causing the issue.  If anyone
would like to see a current example of one of the graphs with bad data, I'll
gladly provide a screenshot.

On Mon, Apr 7, 2008 at 2:27 PM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid>

▸ quoted from Greg L Hubbard

wrote:

 If you look through the source, the RRD support modules are easy to
spot.  If you are using custom graphs, then you need to review the ncv
method, or the "roll your own" method.

But, since "garbage in -> garbage out" you might be on the right path.

GLH

*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Monday, April 07, 2008 1:15 PM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines & graphs

I thought I'd revisit this issue again.  A new thought has occurred to
me...  Where does Hobbit generate the RRD files?  I wonder what parameters
Hobbit is using to pass to rrdtool, and if something there might be acting
funny with some of the data I'm providing to that Hobbit module.

On Wed, Dec 19, 2007 at 10:59 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

It's interesting that it seems the CPU Load and Users and Processes
graphs are the graphs that are most likely to have this strange corruption.
I have also seen it on a few Disk graphs, but not nearly as many as the
other two graphs.  Interestingly, the CPU Utilization, Network I/O, and TCP
Connection Times graphs have _never_ had this corruption.  I'd also like to
say the Memory Utilization graph hasn't had this issue either, though I
can't recall with complete certainty that that is the case.

I wonder what the main difference between the 3 graphs that do have the
issue is, and the 3 (possibly 4) graphs that have never exhibited this
issue.  There must be some physical difference, as I can't imagine it is all
due purely to luck...

list Gary Baluha · Mon, 7 Apr 2008 14:36:08 -0400 ·

Oh, and I'm using the NCV module.

▸ quoted from Gary Baluha


On Mon, Apr 7, 2008 at 2:35 PM, Gary Baluha <user-ae3e15c22de1@xymon.invalid> wrote:

In my case, it seems the "garbage" that is going into the graphs is caused
by a lack of data, rather than actual bad data.  I'm specifically wondering
if there's some time interval mix-up that is causing the issue.  If anyone
would like to see a current example of one of the graphs with bad data, I'll
gladly provide a screenshot.


On Mon, Apr 7, 2008 at 2:27 PM, Hubbard, Greg L <user-d970b5e56ec9@xymon.invalid>
wrote:

 If you look through the source, the RRD support modules are easy to
spot.  If you are using custom graphs, then you need to review the ncv
method, or the "roll your own" method.

But, since "garbage in -> garbage out" you might be on the right path.

GLH

*From:* Gary Baluha [mailto:user-ae3e15c22de1@xymon.invalid]
*Sent:* Monday, April 07, 2008 1:15 PM
*To:* user-ae9b8668bcde@xymon.invalid
*Subject:* Re: [hobbit] strange graph behavior - random machines &
graphs

I thought I'd revisit this issue again.  A new thought has occurred to
me...  Where does Hobbit generate the RRD files?  I wonder what parameters
Hobbit is using to pass to rrdtool, and if something there might be acting
funny with some of the data I'm providing to that Hobbit module.

On Wed, Dec 19, 2007 at 10:59 AM, Gary Baluha <user-ae3e15c22de1@xymon.invalid>
wrote:

It's interesting that it seems the CPU Load and Users and Processes
graphs are the graphs that are most likely to have this strange corruption.
I have also seen it on a few Disk graphs, but not nearly as many as the
other two graphs.  Interestingly, the CPU Utilization, Network I/O, and TCP
Connection Times graphs have _never_ had this corruption.  I'd also like to
say the Memory Utilization graph hasn't had this issue either, though I
can't recall with complete certainty that that is the case.

I wonder what the main difference between the 3 graphs that do have
the issue is, and the 3 (possibly 4) graphs that have never exhibited this
issue.  There must be some physical difference, as I can't imagine it is all
due purely to luck...

strange graph behavior - random machines & graphs 🔗 link

strange graph behavior - random machines & graphs