All Xymon rrd graphs suddenly haywire

list Japheth Cleaver
Tue, 7 Jul 2015 09:02:30 -0700
Message-Id: <user-5e4895f8f0bc@xymon.invalid>


On Tue, July 7, 2015 5:13 am, Steve B wrote:

Hi all,

This weekend, something happened with all our graphs. Every hosts' graphs
are either corrupted or distorted and the history is unusable. I have
checked all the usual places for graphs logging, rrd-data.log and
rrd-status.log and other system log files but I am stumped as to where to
start fixing this.  We are looking at restoring rrds from previous
snapshot
which may or may not work but still would like to solve this mystery.

I have attached 2 screens but I do not know if these are viewable on the
mailing list.  It is hard to explain without but essentially there are
huge
numbers in our graphs such
3945789385793485793847593847593847593847593847593845793485739 and lots of
'?' and there is no usable history, just a straight line along the base
with one peak (or two) around the time this all happened (with a day or
two
out either way). If you try to zoom in, you get to a screen that just says
'zoom source image' and it's a black screen but if you hover your mouse
over the screen you can find an area that is selectable and this shows a
close up of the zoom area

rrdtool info example (for the same screenshot host test):

filename = "disk,C.rrd"
rrd_version = "0003"
step = 300
last_update = 1436270189
ds[pct].type = "GAUGE"
ds[pct].minimal_heartbeat = 600
ds[pct].min = 0.0000000000e+00
ds[pct].max = 1.0000000000e+02
ds[pct].last_ds = "89"
ds[pct].value = 7.9210000000e+03
ds[pct].unknown_sec = 0
ds[used].type = "GAUGE"
ds[used].minimal_heartbeat = 600
ds[used].min = 0.0000000000e+00
ds[used].max = NaN
ds[used].last_ds = "28436524"
ds[used].value = 2.5308506360e+09
ds[used].unknown_sec = 0
rra[0].cf = "AVERAGE"
rra[0].rows = 576
rra[0].pdp_per_row = 1
rra[0].xff = 5.0000000000e-01
rra[0].cdp_prep[0].value = NaN
rra[0].cdp_prep[0].unknown_datapoints = 0
rra[0].cdp_prep[1].value = NaN
rra[0].cdp_prep[1].unknown_datapoints = 0
rra[1].cf = "AVERAGE"
rra[1].rows = 576
rra[1].pdp_per_row = 6
rra[1].xff = 5.0000000000e-01
rra[1].cdp_prep[0].value = 4.4500000000e+02
rra[1].cdp_prep[0].unknown_datapoints = 0
rra[1].cdp_prep[1].value = 1.4218146600e+08
rra[1].cdp_prep[1].unknown_datapoints = 0
rra[2].cf = "AVERAGE"
rra[2].rows = 576
rra[2].pdp_per_row = 24
rra[2].xff = 5.0000000000e-01
rra[2].cdp_prep[0].value = 2.0470000000e+03
rra[2].cdp_prep[0].unknown_datapoints = 0
rra[2].cdp_prep[1].value = 6.5402986560e+08
rra[2].cdp_prep[1].unknown_datapoints = 0
rra[3].cf = "AVERAGE"
rra[3].rows = 576
rra[3].pdp_per_row = 288
rra[3].xff = 5.0000000000e-01
rra[3].cdp_prep[0].value = 1.2727000000e+04
rra[3].cdp_prep[0].unknown_datapoints = 0
rra[3].cdp_prep[1].value = 4.0657944878e+09
rra[3].cdp_prep[1].unknown_datapoints = 0

This weekend we had a network intervention in that we moved some network
connections in one of the 2 data centers but there was no downtime as we
switched the network connectivity to the other data room. Our Xymon server
is running on a virtual server (RHEL5) and the version we are using is
4.3.19.

All graphs were fine until this point.  Any ideas?


This is quite odd.

There aren't too many things that could concertedly affect all RRD's like
that within the code path. Is it the same type of RRD (eg, disk) for all
hosts, or all RRDs for all hosts? Did you see anything unusual in the
status history snapshots (if any) taken around this time?

If it happened to RRDs on both the 'data' and 'status' channels at once,
that narrows down the possibilities even further. I'm assuming you've
checked syslog for host level events for the VM, but did anything odd
happen with the hypervisor around this time? General host memory
corruption is about the only thing I can think of that might cause this --
haven't run into it before.


Regarding fixing the issue, restoring from backups might be the easiest
option. If you want to save the surrounding data, your best bet might be
to export/reimport the RRD to remove the "spike". I've used
http://www.serveradminblog.com/2010/11/remove-spikes-from-rrd-graphs-howto/
in the past for doing this. It's easiest to script around the various
types of RRD files, using a similar max setting for all "la" graphs, for
example.

I seem to recall someone posting a script they had used for this in the
past, but a search of the list archives hasn't revealed anything for me.


HTH,

-jc