Xymon server post-migration blues pt. 1 - "brown-outs"

list Greg Earle
Tue, 23 Apr 2019 21:50:00 -0700
Message-Id: <user-45bb7d8c7bf0@xymon.invalid>
[Apologies in advance for the too-wordy message.  Part 1 of 2	.]

Recently I violated one of the prime rules of being a SysAdmin - don't ever change two things at once.

At my work we were forced to migrate our data center to a new facility, so we bought a new monitoring PC to replace a RHEL 6.10 system that ran Xymon 4.3.12.  The new monitoring PC runs RHEL 7.6 and Xymon 4.3.28 (using the Terabithia RPMs).

Ever since then, I have been running into two big problems that did not exist before.  This is the first problem ("brown-outs"); I'll describe the 2nd in part 2.

Randomly, a group of systems (or some subset of the group) will report in as CRITICAL/RED due to failed xymonnet tests.  Mostly SSH, but some SMTP and FTP as well.  The hosts/services are all actually fine and the red alerts are incorrect/false positives.

The problem is getting worse - I'm now seeing several hundred red alerts a day from these "brown-outs".  The hosts involved are more-or-less random - different buildings/OSes, etc.  Sometimes all of them provoke alerts; most of the time it's just a subset of the list.

When they fail, the alert message is always

--
Service <service> on <host> is not OK : Service listening but unavailable (connect timeout)
--

To try and catch it in the act, I ran this test in a loop:

[root at mgmt xymon]# while true; do ( echo "["`date`"]" ; xymonnet --report --ping --checkresponse --timing --debug --no-update 2>&1 > /tmp/xymonnet.out ; grep 'err=[^0]' /tmp/xymonnet.out ); done

A couple of times I think I did catch it; here's an example:

--
Address=192.168.1.26:22, open=1, res=0, err=1, connecttime=0.004546, totaltime=11.653128,
Address=192.168.1.25:22, open=1, res=0, err=1, connecttime=0.004510, totaltime=11.653092,
Address=192.168.1.219:22, open=1, res=0, err=1, connecttime=0.003163, totaltime=11.651745,
Address=192.168.1.151:22, open=1, res=0, err=1, connecttime=0.002923, totaltime=11.651505,
Address=192.168.1.50:22, open=1, res=0, err=1, connecttime=0.002906, totaltime=11.651488,
Address=137.78.80.38:22, open=1, res=0, err=1, connecttime=0.002819, totaltime=11.651401,

[... another 10 elided ...]

Address=192.168.1.184:22, open=1, res=0, err=1, connecttime=0.001098, totaltime=12.393879,
Address=192.168.1.174:25, open=1, res=0, err=1, connecttime=0.000426, totaltime=12.364234,
Address=192.168.1.182:25, open=1, res=0, err=1, connecttime=0.000418, totaltime=12.364226,
Address=192.168.1.25:25, open=1, res=0, err=1, connecttime=0.000411, totaltime=12.364219,
Address=192.168.1.25:21, open=1, res=0, err=1, connecttime=0.022773, totaltime=12.364044,
--

Notice the non-zero connecttime, but the exceeded-the-timeout totaltime values.

The services always immediately recover in the next test pass.

Are there any knobs I can turn on to help debug this problem?  I'm assuming it's network/router/switch-related, but I need a smoking gun.

Failing that, is there any way in a .cfg file setting to turn these particular "Service listening but unavailable" statuses into a Yellow alert rather than Red?  (I'd rather not have to resort to this but as a stop-gap, I would.)

		- Greg