dropping/making blue checks not persistent when restarting
list Sven Schuster
list Paul Root
So, there’s a couple things here. First, how are you disabling (bluing out) a test (you call check)? Are you checking the “until OK” or are you providing a time limit for the disable? Also, if the test is green why would you want it disabled? Second, why are you restarting xymon after a config change? All configuration files are re-read (except local-client.cfg) every 5 minutes. Next, you say dropped tests reappear. Well of course. If the client is providing the test to the server, the server is going to display it. If you don’t want a test in xymon, it has to be disabled at the source. I don’t understand your second paragraph. You you are saying that you disable a test and then wait 5-10 minutes and the disabled test will remain blue after restarting xymon?
▸
From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Sven Schuster
Sent: Friday, May 19, 2017 7:55 AM
To: xymon at xymon.com
Subject: [Xymon] dropping/making blue checks not persistent when restarting
Hello everybody,
recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.
If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.
Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.
Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?
Thanks in advance,
Sven
This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
list Jeremy Laidman
Make sure xymond can (and does) update it's checkpoint file. See the man page for more info about the checkpoint file. J
▸
On 19 May 2017 23:00, "Sven Schuster" <user-ce24f0ed68fd@xymon.invalid> wrote:
Hello everybody, recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears. If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help. Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour. Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes? Thanks in advance, Sven
list Sven Schuster
Sending USR1 to xymond before restarting makes disabling/dropping tests "survive" the restart in every case.
Kind regards,
▸
Von: "Jeremy Laidman"
An: "Sven Schuster"
Cc: xymon@xymon.com
Betreff: Re: [Xymon] dropping/making blue checks not persistent when restarting
Hello everybody,
recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.
Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?
Thanks in advance,Sven
list Sven Schuster
▸
Kind regards,
Von: "Root, Paul T"
An: "'Sven Schuster'" , "xymon@xymon.com"
Betreff: RE: [Xymon] dropping/making blue checks not persistent when restarting
So, there’s a couple things here.
First, how are you disabling (bluing out) a test (you call check)? Are you checking the “until OK” or are you providing a time limit for the disable? Also, if the test is green why would you want it disabled?
Second, why are you restarting xymon after a config change? All configuration files are re-read (except local-client.cfg) every 5 minutes.
Next, you say dropped tests reappear. Well of course. If the client is providing the test to the server, the server is going to display it. If you don’t want a test in xymon, it has to be disabled at the source.
I don’t understand your second paragraph. You you are saying that you disable a test and then wait 5-10 minutes and the disabled test will remain blue after restarting xymon?
From: Xymon [mailto:xymon-bounces@xymon.com] On Behalf Of Sven Schuster
Sent: Friday, May 19, 2017 7:55 AM
To: xymon@xymon.com
Subject: [Xymon] dropping/making blue checks not persistent when restarting
Hello everybody,
recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.
If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.
Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.
Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?
Thanks in advance,
Sven
list Japheth Cleaver
▸
On 5/22/2017 1:55 AM, Sven Schuster wrote:
Sorry, I should have been a bit more precise in this regard: - test disabled are disabled via enable/disable from the Administration menu for some period of time, e.g. 2 hours, without "until OK" checked. It doesn't matter if you're blueing out a green (e.g. planned downtime) or red test. The problem remains the same. - the restart is done to make changes visible immediately for checking the change after applying it - dropped tests are of checks (or hosts) which don't exist anymore, so there won't be any checks coming in for the checks/hosts dropped Yes when waiting for some time before restarting after disabling or dropping a check, that change will "survive" the restart. As pointed out in Jeremy Laidman's post, this indeed seems to be due to the checkpoint interval which is 600 seconds in the local configuration. Kind regards, Sven *Gesendet:* Freitag, 19. Mai 2017 um 16:02 Uhr
*Von:* "Root, Paul T" <user-76fdb6883669@xymon.invalid>
*An:* "'Sven Schuster'" <user-ce24f0ed68fd@xymon.invalid>, "xymon at xymon.com" <xymon at xymon.com>
▸
*Betreff:* RE: [Xymon] dropping/making blue checks not persistent when restarting
So, there’s a couple things here.
First, how are you disabling (bluing out) a test (you call check)? Are you checking the “until OK” or are you providing a time limit for the disable? Also, if the test is green why would you want it disabled?
Second, why are you restarting xymon after a config change? All configuration files are re-read (except local-client.cfg) every 5 minutes.
Next, you say dropped tests reappear. Well of course. If the client is providing the test to the server, the server is going to display it. If you don’t want a test in xymon, it has to be disabled at the source.
I don’t understand your second paragraph. You you are saying that you disable a test and then wait 5-10 minutes and the disabled test will remain blue after restarting xymon?
*From:*Xymon [mailto:xymon-bounces at xymon.com] *On Behalf Of *Sven Schuster
▸
*Sent:* Friday, May 19, 2017 7:55 AM
*To:* xymon at xymon.com
*Subject:* [Xymon] dropping/making blue checks not persistent when restarting
Hello everybody,
recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.
If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.
Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.
Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?
Thanks in advance,
Sven
Hi Sven, This behavior would seem to point in the direction of the checkpoint file not being written out properly on shutdown, especially if it's working fine during the normal checkpointing process (eg, waiting 600 seconds before the restart) and could be a latent bug (or at least a missing error message). Can you set xymond to --debug mode (or send it -USR2 signal) and then shutdown/restart the process after this change? If shutting down, you can take a quick poke at the checkpoint file to see that it's been updated at the moment of shutdown? Depending on the host in question, you can also search for the test that should "no longer be there" (it's just a simple text file format). The same routine is called at shutdown as is called during the periodic interval checkpointing, except for the fact that we wait synchronously for it to complete -- precisely to avoid this type of concern, but that doesn't mean there isn't an issue there still. Regards, -jc
list Sven Schuster
Hi Japheth,
▸
Hi Sven, This behavior would seem to point in the direction of the checkpoint file not being written out properly on shutdown, especially if it's working fine during the normal checkpointing process (eg, waiting 600 seconds before the restart) and could be a latent bug (or at least a missing error message).
that was exactly my thought when taking a look at the source code. The routine for writing the checkpoint file should be called at shutdown, too...
▸
Can you set xymond to --debug mode (or send it -USR2 signal) and then shutdown/restart the process after this change? If shutting down, you can take a quick poke at the checkpoint file to see that it's been updated at the moment of shutdown? Depending on the host in question, you can also search for the test that should "no longer be there" (it's just a simple text file format).
...and indeed, it *is* called:
10410 2017-05-23 08:00:48.870364 -> save_checkpoint
10410 2017-05-23 08:00:48.963874 <- save_checkpoint
These were the last lines of the logfile when stopping xymon. Note that in
this case, I *stopped* the xymon service (to be able to take a look at the
checkpoint file while xymon is not running). Timestamp of checkpoint file
was updated, the test I disabled still was disabled when I started xymon
again. Strange.
So I did some further testing. It revealed that on Debian with systemd being
used for starting/stoping services, the restart option to the default SysV
initscript isn't used. Instead, systemd will call the initscript with option
stop (which TERMs the xymonlaunch process), wait some amount of time (which
is probably given by the RestartSec or RestartUSec parameter, see
systemd.service(5)), then the initscript is called again with option start.
Seems like the time between stop and start (which is 100ms in the local
environment, probably default value) is not long enough for the old,
terminating xymond process to completely write the checkpoint file (which is
roughly 35 MB here with config changes and disabling/dropping tests
happening quite often and independently). In xymond.c/save_checkpoint it
turns out that the checkpoint file is written to a temporary file with a
timestamp in the filename. That temp file is renamed to the real checkpoint
file later.
With that short amount of time between stopping and starting it seems like
the new xymond process, which is starting in the meantime, just reads an old
version of the checkpoint file.
To solve this issue, on Linux systems using systemd one might (and of course
should ;)) use a real systemd service file with RestartSec set to a sane
amount (e.g. 1s like in the old SysV initscript).
As a quick fix I added a "sleep 1" in the initscript:
--- xymon.orig 2012-06-27 21:14:29.000000000 +0200
+++ xymon 2017-05-23 10:28:51.983171661 +0200
@@ -49,6 +49,7 @@
"stop")
log_daemon_msg "Stopping $DESC" "$NAME"
start-stop-daemon --exec $DAEMON --pidfile $PIDFILE --stop --retry 5
+ sleep 1
log_end_msg $?
;;
That way restarting xymon works as expected for me.
Yet that might leave the (small) chance of that timespan not being long
enough in big installation and high load. Which in turn could just be a
hypothetical problem, as that behaviour didn't occur with the old
initscript (or at least no one noticed).
A clean solution would be to provide a way to do a clean shutdown of the
xymon server which returns not before the old processes really have exited
(however that might be implemented), so the asynchronous nature of the
current stop (sending a TERM to xymonlaunch) is not a concern anymore.
That's at least an explanation and possible ways of solving for the
behaviour that seems to make sense based on some tests and taking some
short looks at the source, so please correct me if I'm wrong ;)
Kind regards,
Sven
▸
The same routine is called at shutdown as is called during the periodic interval checkpointing, except for the fact that we wait synchronously for it to complete -- precisely to avoid this type of concern, but that doesn't mean there isn't an issue there still. Regards, -jc