Xymon Mailing List Archive search

Always purple history after time shift on server - how to fix

1 message in this thread

list Andrey Chervonets · Thu, 10 Mar 2016 11:44:32 +0200 ·
I would like to share some hints in resolving history reporting problem after big time shift on monitoring server - about 4 hours.
May be it will help anyone else.

It was some month ago, but I have found time to fix it only today.
What happened:
1. Time on monitoring host increased for 4 hours.
2. As result - all metrics reported Purple status (it is intended functionality, but would be nice XyMon detect big time shift and adopt reporting in some way)
3. It was problem at virtual host provider, I had reported the problem and time was fixed back to correct value
4. To fix current reporting I had cleaned some files under xymon/logs or acks (really I do not remember which ones right now) - this has reset last status duration information, but current values for all metrics become correct
5. Everythig become  OK, except that when I check history for metric ( ...xymon-cgi/history.sh? ...)  for some metrics.
XyMon always reported Purple for last event (since that incident time).


It was just for some metrics (not all) and I had second monitoring server with the same information (not having time shift incident) and I was able to live with it some month.

Solution: Today I have fixed that reporting problem with the following steps, which should be executed for every host-metric pair having the problem

We should operate with 2 files:
1) host history file  like  hist/HOSTNAME # here we should find records with negative duration values like:
svcs 1435410898 1435426055 -15157 gr pu 1
who 1435410899 1435426055 -15156 gr pu 1
msgs 1435410899 1435426055 -15156 gr pu 1
netstat 1435410899 1435426055 -15156 gr pu 1
memory 1435411034 1435426055 -15021 ye pu 2
uptime 1435411140 1435426055 -14915 gr pu 1
procs 1435411145 1435426055 -14910 gr pu 1
disk 1435411150 1435426055 -14905 ye pu 2
cpu 1435411222 1435426055 -14833 gr pu 1

# and drop them

2) service history file like
 hist/HOSTNAME.svc
# again -  find records with negative duration values like:
Sat Jun 27 20:27:35 2015 purple 1435426055 -15157

# and  drop record(s)  - really should be just one 

Really to fix just one service reporting - it is enough to drop negative duration records from service history file only (tested).
But I do not see any reason to have such records in host history file, so I delete from that file too.

How to automate the process:
# find hist files for # step 1: find hist/ -print0 -name "*.*" | xargs -0 grep " -" | awk '{print $1" :"$4}' | grep ":-"

#output like:
...
hist/idc-oracle03.msc-sh.local:ssh :-14862
hist/idc-oracle03.msc-sh.local:dblock :-15012
hist/idc-oracle03.msc-sh.local:dbrec :-15012
hist/idc-oracle03.msc-sh.local:dbup :-15011
hist/idc-oracle03.msc-sh.local:dbext :-14989
...

# step 2:    find hist/ -print0 -name "*.*" | xargs -0 grep " -" | awk '{print $1" :"$8}' | grep ":-"
# output like:
..
hist/idc-oracle03,domain.com.dbrec:Sat :-15012
hist/gdc-oracle03,domain.com.dbup:Sat :-15136
hist/idc-oracle01,domain.com.disk:Sat :-14961
hist/gdc-oracle01,domain.com.dbaud:Thu :-26793
hist/gdc-oracle01,domain.com.dbaud:Sat :-14940
..

Then can automate the records removal too.


Best regards,

Andrey Chervonets
SIA CoMinder
http://www.cominder.eu/