All Xymon rrd graphs suddenly haywire

list Ralph Mitchell
Mon, 3 Aug 2015 12:56:18 -0400
Message-Id: <CAAEjoCXM_JNwdCMK37D1gbZZTFyc1LtkgquqH=4kWDRfXC=user-5ae0cd2d1bbf@xymon.invalid>

I'm getting another set of massive spikes.  Everything that updated in the
last couple of days has this:

rra[3].cf = "AVERAGE"
rra[3].rows = 864
rra[3].pdp_per_row = 288
rra[3].xff = 5.0000000000e-01
rra[3].cdp_prep[0].value = 2.8577631848e+94
rra[3].cdp_prep[0].unknown_datapoints = 0


Yep, "e+94".  The above sample is the clock offset on my Linux desktop.
Similarly disk, inode, memory, users, etc.

This is also happening with homegrown tests.  For example, a script that
does this:

     /usr/bin/time -p openssl s_client -connect $COSERVER

to get the timing statistics for a connection.  It shows the same
ridiculously big numbers, but only for a few samples:

                        <!-- 2015-08-03 09:50:00 EDT / 1438609800 -->
<row><v> 1.2500000000e-02 </v><v> 1.0000000000e-02 </v><v> 0.0000000000e+00
</v></row>
                        <!-- 2015-08-03 09:55:00 EDT / 1438610100 -->
<row><v> 2.0411919169e+93 </v><v> 2.0411919169e+93 </v><v> 2.0411919169e+93
</v></row>
                        <!-- 2015-08-03 10:00:00 EDT / 1438610400 -->
<row><v> 2.0411919169e+93 </v><v> 2.0411919169e+93 </v><v> 2.0411919169e+93
</v></row>
                        <!-- 2015-08-03 10:05:00 EDT / 1438610700 -->
<row><v> 2.0411919169e+93 </v><v> 2.0411919169e+93 </v><v> 2.0411919169e+93
</v></row>
                        <!-- 2015-08-03 10:10:00 EDT / 1438611000 -->
<row><v> 5.6024046835e+89 </v><v> 5.6024046835e+89 </v><v> 5.6024046835e+89
</v></row>
                        <!-- 2015-08-03 10:15:00 EDT / 1438611300 -->
<row><v> 1.8600000000e-02 </v><v> 1.0000000000e-02 </v><v> 0.0000000000e+00
</v></row>
                        <!-- 2015-08-03 10:20:00 EDT / 1438611600 -->
<row><v> 2.0000000000e-02 </v><v> 1.0000000000e-02 </v><v> 0.0000000000e+00
</v></row>

I'm going to try the spike removal technique and see what happens.

I'm not getting alerts saying disks are umpteen bazillion % full, which is
good.  It also suggests the stupid numbers are creeping in somewhere in the
RRD backend.

I'm also getting graphs from one server showing up under another server.
I.e. on a disk page that shows just the standard df listing with /, /usr,
/var, /home, /tmp I'm seeing graphs for filesystems that exist on a
different machine.  I don't know if that's related though.

Ralph Mitchell


On Fri, Jul 10, 2015 at 5:17 AM, Steve B <user-df463d3c0721@xymon.invalid> wrote:

It's pretty much all the graphs, GAUGE or not. We upgraded our rrdtool as
we were on an older version and it seemed ok for hours but then in the AM
there were some massive spikes and it has spread like wildfire and we are
back where we started.  Not all graphs are affected though. It seems random
but it's probably not.
Still looking at stats and graphs for the vm from inside and out. Very
frustrating all this!
Thanks

On Wed, Jul 8, 2015 at 3:14 AM, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:

On 7 July 2015 at 22:13, Steve B <user-df463d3c0721@xymon.invalid> wrote:

ds[pct].min = 0.0000000000e+00
ds[pct].max = 1.0000000000e+02
ds[pct].last_ds = "89"
ds[pct].value = 7.9210000000e+03

Well, this is interesting.  The "max" is set at 100%, but rrdtool
accepted a value of 7921%.

I've had this happen in the past, but haven't found the cause.  I ended
up doing an xport/edit/restore on each RRD file affected.  However, it's
only happened here and there.  I've never seen a widespread problem across
lots of graphs all at the same time.  My first thought was a counter-wrap
problem, but as I recall, I quickly eliminated that as a possible cause.

Are all affected graphs of type GAUGE?

J