Xymon Mailing List Archive search

dropping/making blue checks not persistent when restarting

7 messages in this thread

list Sven Schuster · Fri, 19 May 2017 14:55:17 +0200 ·
Hello everybody,

recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.
If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.
Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.

Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?

Thanks in advance,
Sven
list Paul Root · Fri, 19 May 2017 14:02:43 +0000 ·
So, there’s a couple things here.

First, how are you disabling (bluing out) a test (you call check)? Are you checking the “until OK” or are you providing a time limit for the disable? Also, if the test is green why would you want it disabled?

Second, why are you restarting xymon after a config change? All configuration files are re-read (except local-client.cfg) every 5 minutes.

Next, you say dropped tests reappear. Well of course. If the client is providing the test to the server, the server is going to display it. If you don’t want a test in xymon, it has to be disabled at the source.

I don’t understand your second paragraph. You you are saying that you disable a test and then wait 5-10 minutes and the disabled test will remain blue after restarting xymon?
quoted from Sven Schuster

From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Sven Schuster
Sent: Friday, May 19, 2017 7:55 AM
To: xymon at xymon.com
Subject: [Xymon] dropping/making blue checks not persistent when restarting

Hello everybody,

recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.
If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.
Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.

Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?

Thanks in advance,
Sven

This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
list Jeremy Laidman · Sat, 20 May 2017 01:18:38 +1000 ·
Make sure xymond can (and does) update it's checkpoint file. See the man
page for more info about the checkpoint file.

J
quoted from Sven Schuster


On 19 May 2017 23:00, "Sven Schuster" <user-ce24f0ed68fd@xymon.invalid> wrote:
Hello everybody,

recently I've been seeing a strange issue on xymon server. When I make a
check blue and shortly after xymon gets restarted due to configuration
updates, that blue check will be green again afterwards. The same thing
happens when a check is dropped and xymon gets restarted directly after
that: the dropped check reappears.
If you wait some amount of time before restarting, say 5-10 minutes, the
problem won't appear and everything will be fine. I also sync'ed on the
server directly after making a check blue and before restarting (to avoid
data not being written to disk for some strange reason), which
unfortunately did not help.
Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to
4.3.28 because of this problem lately, with the problem appearing in
4.3.28, too. This server has just been upgraded from Debian wheezy to
Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show
this behaviour.

Did anybody notice such an odd behaviour or maybe have any thoughts
regarding possible causes?

Thanks in advance,
Sven

list Sven Schuster · Mon, 22 May 2017 10:54:45 +0200 ·
Hi Jeremy,

thanks for pointing that out.
After some initial testing, it seems like it's indeed due to the checkpoint interval, which is 600 seconds in the local configuration.
Sending USR1 to xymond before restarting makes disabling/dropping tests "survive" the restart in every case.
Still curious why this happens just now after upgrading the OS from Debian wheezy to jessie with otherwise same configuration...

Kind regards,
Sven
quoted from Jeremy Laidman
Gesendet: Freitag, 19. Mai 2017 um 17:18 Uhr
Von: "Jeremy Laidman"
An: "Sven Schuster"
Cc: xymon@xymon.com
Betreff: Re: [Xymon] dropping/making blue checks not persistent when restarting
Make sure xymond can (and does) update it's checkpoint file. See the man page for more info about the checkpoint file.
J

On 19 May 2017 23:00, "Sven Schuster" <user-ce24f0ed68fd@xymon.invalid> wrote:
Hello everybody,

recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.
If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.
Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.

Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?

Thanks in advance,
Sven


list Sven Schuster · Mon, 22 May 2017 10:55:05 +0200 ·
Sorry, I should have been a bit more precise in this regard:
- test disabled are disabled via enable/disable from the Administration menu for some period of time, e.g. 2 hours, without "until OK" checked. It doesn't matter if you're blueing out a green (e.g. planned downtime) or red test. The problem remains the same.
- the restart is done to make changes visible immediately for checking the change after applying it
- dropped tests are of checks (or hosts) which don't exist anymore, so there won't be any checks coming in for the checks/hosts dropped

Yes when waiting for some time before restarting after disabling or dropping a check, that change will "survive" the restart. As pointed out in Jeremy Laidman's post, this indeed seems to be due to the checkpoint interval which is 600 seconds in the local configuration.
quoted from Paul Root


Kind regards,
Sven
Gesendet: Freitag, 19. Mai 2017 um 16:02 Uhr
Von: "Root, Paul T"
An: "'Sven Schuster'" , "xymon@xymon.com"
Betreff: RE: [Xymon] dropping/making blue checks not persistent when restarting

So, there’s a couple things here.


First, how are you disabling (bluing out) a test (you call check)? Are you checking the “until OK” or are you providing a time limit for the disable? Also, if the test is green why would you want it disabled?


Second, why are you restarting xymon after a config change? All configuration files are re-read (except local-client.cfg) every 5 minutes.


Next, you say dropped tests reappear. Well of course. If the client is providing the test to the server, the server is going to display it. If you don’t want a test in xymon, it has to be disabled at the source.


I don’t understand your second paragraph. You you are saying that you disable a test and then wait 5-10 minutes and the disabled test will remain blue after restarting xymon?


From: Xymon [mailto:xymon-bounces@xymon.com] On Behalf Of Sven Schuster
Sent: Friday, May 19, 2017 7:55 AM
To: xymon@xymon.com
Subject: [Xymon] dropping/making blue checks not persistent when restarting


Hello everybody,


recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.

If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.

Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.


Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?


Thanks in advance,

Sven

This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
list Japheth Cleaver · Mon, 22 May 2017 09:46:39 -0700 ·
quoted from Sven Schuster
On 5/22/2017 1:55 AM, Sven Schuster wrote:
Sorry, I should have been a bit more precise in this regard:
- test disabled are disabled via enable/disable from the Administration menu for some period of time, e.g. 2 hours, without "until OK" checked. It doesn't matter if you're blueing out a green (e.g. planned downtime) or red test. The problem remains the same.
- the restart is done to make changes visible immediately for checking the change after applying it
- dropped tests are of checks (or hosts) which don't exist anymore, so there won't be any checks coming in for the checks/hosts dropped
Yes when waiting for some time before restarting after disabling or dropping a check, that change will "survive" the restart. As pointed out in Jeremy Laidman's post, this indeed seems to be due to the checkpoint interval which is 600 seconds in the local configuration.

Kind regards,
Sven
*Gesendet:* Freitag, 19. Mai 2017 um 16:02 Uhr

*Von:* "Root, Paul T" <user-76fdb6883669@xymon.invalid>
*An:* "'Sven Schuster'" <user-ce24f0ed68fd@xymon.invalid>, "xymon at xymon.com" <xymon at xymon.com>
quoted from Sven Schuster
*Betreff:* RE: [Xymon] dropping/making blue checks not persistent when restarting

So, there’s a couple things here.

First, how are you disabling (bluing out) a test (you call check)? Are you checking the “until OK” or are you providing a time limit for the disable? Also, if the test is green why would you want it disabled?

Second, why are you restarting xymon after a config change? All configuration files are re-read (except local-client.cfg) every 5 minutes.

Next, you say dropped tests reappear. Well of course. If the client is providing the test to the server, the server is going to display it. If you don’t want a test in xymon, it has to be disabled at the source.

I don’t understand your second paragraph. You you are saying that you disable a test and then wait 5-10 minutes and the disabled test will remain blue after restarting xymon?

*From:*Xymon [mailto:xymon-bounces at xymon.com] *On Behalf Of *Sven Schuster
quoted from Sven Schuster
*Sent:* Friday, May 19, 2017 7:55 AM
*To:* xymon at xymon.com
*Subject:* [Xymon] dropping/making blue checks not persistent when restarting

Hello everybody,

recently I've been seeing a strange issue on xymon server. When I make a check blue and shortly after xymon gets restarted due to configuration updates, that blue check will be green again afterwards. The same thing happens when a check is dropped and xymon gets restarted directly after that: the dropped check reappears.

If you wait some amount of time before restarting, say 5-10 minutes, the problem won't appear and everything will be fine. I also sync'ed on the server directly after making a check blue and before restarting (to avoid data not being written to disk for some strange reason), which unfortunately did not help.

Environment is xymon 4.3.27 on Debian jessie. Xymon has been updated to 4.3.28 because of this problem lately, with the problem appearing in 4.3.28, too. This server has just been upgraded from Debian wheezy to Jessie a few weeks ago. On wheezy xymon 4.3.27 was in use but didn't show this behaviour.

Did anybody notice such an odd behaviour or maybe have any thoughts regarding possible causes?

Thanks in advance,

Sven
Hi Sven,

This behavior would seem to point in the direction of the checkpoint file not being written out properly on shutdown, especially if it's working fine during the normal checkpointing process (eg, waiting 600 seconds before the restart) and could be a latent bug (or at least a missing error message).

Can you set xymond to --debug mode (or send it  -USR2 signal) and then shutdown/restart the process after this change? If shutting down, you can take a quick poke at the checkpoint file to see that it's been updated at the moment of shutdown? Depending on the host in question, you can also search for the test that should "no longer be there" (it's just a simple text file format).

The same routine is called at shutdown as is called during the periodic interval checkpointing, except for the fact that we wait synchronously for it to complete -- precisely to avoid this type of concern, but that doesn't mean there isn't an issue there still.

Regards,

-jc
list Sven Schuster · Tue, 23 May 2017 11:31:56 +0200 ·
Hi Japheth,
quoted from Japheth Cleaver
Hi Sven, This behavior would seem to point in the direction of the
checkpoint file not being written out properly on shutdown, especially if
it's working fine during the normal checkpointing process (eg, waiting 600
seconds before the restart) and could be a latent bug (or at least a
missing error message).
that was exactly my thought when taking a look at the source code. The
routine for writing the checkpoint file should be called at shutdown, too...
quoted from Japheth Cleaver
Can you set xymond to --debug mode (or send it  -USR2 signal) and then
shutdown/restart the process after this change? If shutting down, you can
take a quick poke at the checkpoint file to see that it's been updated at
the moment of shutdown? Depending on the host in question, you can also
search for the test that should "no longer be there" (it's just a simple
text file format).
...and indeed, it *is* called:

10410 2017-05-23 08:00:48.870364 -> save_checkpoint
10410 2017-05-23 08:00:48.963874 <- save_checkpoint

These were the last lines of the logfile when stopping xymon. Note that in
this case, I *stopped* the xymon service (to be able to take a look at the
checkpoint file while xymon is not running). Timestamp of checkpoint file
was updated, the test I disabled still was disabled when I started xymon
again. Strange.

So I did some further testing. It revealed that on Debian with systemd being
used for starting/stoping services, the restart option to the default SysV
initscript isn't used. Instead, systemd will call the initscript with option
stop (which TERMs the xymonlaunch process), wait some amount of time (which
is probably given by the RestartSec or RestartUSec parameter, see
systemd.service(5)), then the initscript is called again with option start.

Seems like the time between stop and start (which is 100ms in the local
environment, probably default value) is not long enough for the old,
terminating xymond process to completely write the checkpoint file (which is
roughly 35 MB here with config changes and disabling/dropping tests
happening quite often and independently). In xymond.c/save_checkpoint it
turns out that the checkpoint file is written to a temporary file with a
timestamp in the filename. That temp file is renamed to the real checkpoint
file later.
With that short amount of time between stopping and starting it seems like
the new xymond process, which is starting in the meantime, just reads an old
version of the checkpoint file.

To solve this issue, on Linux systems using systemd one might (and of course
should ;)) use a real systemd service file with RestartSec set to a sane
amount (e.g. 1s like in the old SysV initscript).
As a quick fix I added a "sleep 1" in the initscript:

--- xymon.orig  2012-06-27 21:14:29.000000000 +0200
+++ xymon       2017-05-23 10:28:51.983171661 +0200
@@ -49,6 +49,7 @@
    "stop")
        log_daemon_msg "Stopping $DESC" "$NAME"
        start-stop-daemon --exec $DAEMON --pidfile $PIDFILE --stop --retry 5
+       sleep 1
        log_end_msg $?
        ;;


That way restarting xymon works as expected for me.
Yet that might leave the (small) chance of that timespan not being long
enough in big installation and high load. Which in turn could just be a
hypothetical problem, as that behaviour didn't occur with the old
initscript (or at least no one noticed).
A clean solution would be to provide a way to do a clean shutdown of the
xymon server which returns not before the old processes really have exited
(however that might be implemented), so the asynchronous nature of the
current stop (sending a TERM to xymonlaunch) is not a concern anymore.

That's at least an explanation and possible ways of solving for the
behaviour that seems to make sense based on some tests and taking some
short looks at the source, so please correct me if I'm wrong ;)


Kind regards,
Sven
quoted from Japheth Cleaver

The same routine is called at shutdown as is called during the periodic
interval checkpointing, except for the fact that we wait synchronously for
it to complete -- precisely to avoid this type of concern, but that
doesn't mean there isn't an issue there still.
Regards, -jc