Xymon "port" check intermittent failures for ssh TCP port 22 state=LISTEN

6 messages in this thread

list Jeremy Laidman · Fri, 7 Jul 2017 14:47:07 +1000 ·

Hi

I'm getting what appear to be false-positives for the port test that is
monitoring the LISTEN socket for port 22, as opened by the sshd daemon. A
few times a month, Xymon will show that the server is not listening on port
22, and 5 minutes later, the listening port is back again. The sshd process
has never crashed or been reconfigured (eg with SIGHUP), and no other
listening ports are showing the same behaviour.  The client messages for
the server during these events are complete and uncorrupted.

The simplest fix is to use delayred to suppress alerts for 5 minutes.
However, I would like to work out what's causing this behaviour. I don't
believe this a problem with Xymon at all, and instead the netstat output in
the client message is exactly what the OS provided the Xymon client. My
guess is that it's due to a the way sshd works - perhaps it periodically
rebinds to the socket - but nothing in the sshd logs seems to correlate
with these events. If anyone can suggest what might be causing this, or how
to investigate further, I'd be grateful.

This problem happens for about a quarter of the servers in a pool, and no
others. All servers are identical in OS, software and general
configuration, but the servers affected by this tend to be the ones taking
the most traffic and under the most load (although there's plenty of spare
CPU cycles even on the most heavily-used server). I have two Xymon servers,
each monitoring independently of the other, and this problem is reported by
both Xymon servers, although at completely different dates and times.

Cheers
Jeremy

list Ryan Novosielski · Fri, 7 Jul 2017 05:05:14 +0000 ·

Any chance this is truncation happening? That test can have a lot of output.

--
____
|| \\UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - user-46c89e614701@xymon.invalid<mailto:user-46c89e614701@xymon.invalid>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630, Newark

▸ quoted from Jeremy Laidman

On Jul 7, 2017, at 00:47, Jeremy Laidman <user-71895fb2e44c@xymon.invalid<mailto:user-71895fb2e44c@xymon.invalid>> wrote:

I'm getting what appear to be false-positives for the port test that is monitoring the LISTEN socket for port 22, as opened by the sshd daemon. A few times a month, Xymon will show that the server is not listening on port 22, and 5 minutes later, the listening port is back again. The sshd process has never crashed or been reconfigured (eg with SIGHUP), and no other listening ports are showing the same behaviour. The client messages for the server during these events are complete and uncorrupted.

The simplest fix is to use delayred to suppress alerts for 5 minutes. However, I would like to work out what's causing this behaviour. I don't believe this a problem with Xymon at all, and instead the netstat output in the client message is exactly what the OS provided the Xymon client. My guess is that it's due to a the way sshd works - perhaps it periodically rebinds to the socket - but nothing in the sshd logs seems to correlate with these events. If anyone can suggest what might be causing this, or how to investigate further, I'd be grateful.

This problem happens for about a quarter of the servers in a pool, and no others. All servers are identical in OS, software and general configuration, but the servers affected by this tend to be the ones taking the most traffic and under the most load (although there's plenty of spare CPU cycles even on the most heavily-used server). I have two Xymon servers, each monitoring independently of the other, and this problem is reported by both Xymon servers, although at completely different dates and times.

Cheers
Jeremy

list Jeremy Laidman · Fri, 7 Jul 2017 16:51:50 +1000 ·

Not much chance, really. This was my first guess at the cause. The [ports]
section appears complete (doesn't have its own limit as far as I know), the
[clock] section is present at the end, and the UTC: datestamp line is
present as the last line. Hence no artefacts I would expect to see when
truncation takes place.

Also, the client messages are less than 300kB, whereas the default limit is
512kB and I've bumped that up to 2MB.

▸ quoted from Ryan Novosielski



On 7 July 2017 at 15:05, Ryan Novosielski <user-46c89e614701@xymon.invalid> wrote:

Any chance this is truncation happening? That test can have a lot of
output.

--
____
|| \\UTGERS,       |---------------------------*
O*---------------------------


||_// the State     |         Ryan Novosielski - user-46c89e614701@xymon.invalid

▸ quoted from Ryan Novosielski

|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630,
Newark
    `'

On Jul 7, 2017, at 00:47, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:

Hi

I'm getting what appear to be false-positives for the port test that is
monitoring the LISTEN socket for port 22, as opened by the sshd daemon. A
few times a month, Xymon will show that the server is not listening on port
22, and 5 minutes later, the listening port is back again. The sshd process
has never crashed or been reconfigured (eg with SIGHUP), and no other
listening ports are showing the same behaviour.  The client messages for
the server during these events are complete and uncorrupted.

The simplest fix is to use delayred to suppress alerts for 5 minutes.
However, I would like to work out what's causing this behaviour. I don't
believe this a problem with Xymon at all, and instead the netstat output in
the client message is exactly what the OS provided the Xymon client. My
guess is that it's due to a the way sshd works - perhaps it periodically
rebinds to the socket - but nothing in the sshd logs seems to correlate
with these events. If anyone can suggest what might be causing this, or how
to investigate further, I'd be grateful.

This problem happens for about a quarter of the servers in a pool, and no
others. All servers are identical in OS, software and general
configuration, but the servers affected by this tend to be the ones taking
the most traffic and under the most load (although there's plenty of spare
CPU cycles even on the most heavily-used server). I have two Xymon servers,
each monitoring independently of the other, and this problem is reported by
both Xymon servers, although at completely different dates and times.

Cheers
Jeremy

list Mike Burger · Fri, 07 Jul 2017 14:08:42 -0400 ·

▸ quoted from Jeremy Laidman

On 2017-07-07 2:51 am, Jeremy Laidman wrote:

Not much chance, really. This was my first guess at the cause. The [ports] section appears complete (doesn't have its own limit as far as I know), the [clock] section is present at the end, and the UTC: datestamp line is present as the last line. Hence no artefacts I would expect to see when truncation takes place.
Also, the client messages are less than 300kB, whereas the default limit is 512kB and I've bumped that up to 2MB.
On 7 July 2017 at 15:05, Ryan Novosielski <user-46c89e614701@xymon.invalid> wrote:

Any chance this is truncation happening? That test can have a lot of output.

--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - user-46c89e614701@xymon.invalid
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'
On Jul 7, 2017, at 00:47, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:

Hi
I'm getting what appear to be false-positives for the port test that is monitoring the LISTEN socket for port 22, as opened by the sshd daemon. A few times a month, Xymon will show that the server is not listening on port 22, and 5 minutes later, the listening port is back again. The sshd process has never crashed or been reconfigured (eg with SIGHUP), and no other listening ports are showing the same behaviour. The client messages for the server during these events are complete and uncorrupted.
The simplest fix is to use delayred to suppress alerts for 5 minutes. However, I would like to work out what's causing this behaviour. I don't believe this a problem with Xymon at all, and instead the netstat output in the client message is exactly what the OS provided the Xymon client. My guess is that it's due to a the way sshd works - perhaps it periodically rebinds to the socket - but nothing in the sshd logs seems to correlate with these events. If anyone can suggest what might be causing this, or how to investigate further, I'd be grateful.
This problem happens for about a quarter of the servers in a pool, and no others. All servers are identical in OS, software and general configuration, but the servers affected by this tend to be the ones taking the most traffic and under the most load (although there's plenty of spare CPU cycles even on the most heavily-used server). I have two Xymon servers, each monitoring independently of the other, and this problem is reported by both Xymon servers, although at completely different dates and times.
Cheers Jeremy

Have you considered adding the SSH network test, in conjunction? 
-- 
Mike Burger
http://www.bubbanfriends.org

"It's always suicide-mission this, save-the-planet that. No one ever
just stops by to say 'hi' anymore." --Colonel Jack O'Neill, SG1

list Jeremy Laidman · Sat, 8 Jul 2017 08:32:30 +1000 ·

Yes, I do the network test also. This means I could just disable 22 in the
port test, and rely on the network test. It's an adequate work-around in
this case. Thanks.

I'd still like to know why it's a problem.

J

▸ quoted from Mike Burger


On 8 Jul. 2017 04:08, "Mike Burger" <user-cc5c6e80f4c5@xymon.invalid> wrote:

On 2017-07-07 2:51 am, Jeremy Laidman wrote:

Not much chance, really. This was my first guess at the cause. The [ports]
section appears complete (doesn't have its own limit as far as I know), the
[clock] section is present at the end, and the UTC: datestamp line is
present as the last line. Hence no artefacts I would expect to see when
truncation takes place.

Also, the client messages are less than 300kB, whereas the default limit is
512kB and I've bumped that up to 2MB.


On 7 July 2017 at 15:05, Ryan Novosielski <user-46c89e614701@xymon.invalid> wrote:

Any chance this is truncation happening? That test can have a lot of
output.

--
____
|| \\UTGERS,       |---------------------------*O
*---------------------------
||_// the State     |         Ryan Novosielski - user-46c89e614701@xymon.invalid
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630,
Newark
    `'

On Jul 7, 2017, at 00:47, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:

Hi

I'm getting what appear to be false-positives for the port test that is
monitoring the LISTEN socket for port 22, as opened by the sshd daemon. A
few times a month, Xymon will show that the server is not listening on port
22, and 5 minutes later, the listening port is back again. The sshd process
has never crashed or been reconfigured (eg with SIGHUP), and no other
listening ports are showing the same behaviour.  The client messages for
the server during these events are complete and uncorrupted.

The simplest fix is to use delayred to suppress alerts for 5 minutes.
However, I would like to work out what's causing this behaviour. I don't
believe this a problem with Xymon at all, and instead the netstat output in
the client message is exactly what the OS provided the Xymon client. My
guess is that it's due to a the way sshd works - perhaps it periodically
rebinds to the socket - but nothing in the sshd logs seems to correlate
with these events. If anyone can suggest what might be causing this, or how
to investigate further, I'd be grateful.

This problem happens for about a quarter of the servers in a pool, and no
others. All servers are identical in OS, software and general
configuration, but the servers affected by this tend to be the ones taking
the most traffic and under the most load (although there's plenty of spare
CPU cycles even on the most heavily-used server). I have two Xymon servers,
each monitoring independently of the other, and this problem is reported by
both Xymon servers, although at completely different dates and times.

Cheers
Jeremy

Have you considered adding the SSH network test, in conjunction?

-- 
Mike Burger
http://www.bubbanfriends.org

"It's always suicide-mission this, save-the-planet that. No one ever just
stops by to say 'hi' anymore." --Colonel Jack O'Neill, SG1

list Jeremy Laidman · Wed, 12 Jul 2017 16:51:42 +1000 ·

Just giving a follow-up for those interested or affected by this.

I believe I'm closer to understanding this problem. I've setup two "while"
loops on the server, one that runs "netstat -nl | grep :22" every second,
and the other that runs "ss -ln|grep :22" every second. In the former case,
I get output most of the time, but I get no output about 3-6 times every
couple of hours. In the latter case, I always get the expected output. This
suggests to me that netstat is not doing the right thing, possibly due to a
race condition that is exacerbated under load.

Ultimately, it's not a Xymon problem at all, it would seem. A Xymon fix
might be to modify xymonclient-linux.sh to use "ss" instead of "netstat",
but he output formats are different, and it would require the parser to be
re-written or enhanced. Instead, I should get netstat fixed.

▸ quoted from Jeremy Laidman



On 8 July 2017 at 08:32, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:

Yes, I do the network test also. This means I could just disable 22 in the
port test, and rely on the network test. It's an adequate work-around in
this case. Thanks.

I'd still like to know why it's a problem.

J


On 8 Jul. 2017 04:08, "Mike Burger" <user-cc5c6e80f4c5@xymon.invalid> wrote:

On 2017-07-07 2:51 am, Jeremy Laidman wrote:

Not much chance, really. This was my first guess at the cause. The [ports]
section appears complete (doesn't have its own limit as far as I know), the
[clock] section is present at the end, and the UTC: datestamp line is
present as the last line. Hence no artefacts I would expect to see when
truncation takes place.

Also, the client messages are less than 300kB, whereas the default limit
is 512kB and I've bumped that up to 2MB.


On 7 July 2017 at 15:05, Ryan Novosielski <user-46c89e614701@xymon.invalid> wrote:

Any chance this is truncation happening? That test can have a lot of
output.

--
____
|| \\UTGERS,       |---------------------------*O
*---------------------------
||_// the State     |         Ryan Novosielski - user-46c89e614701@xymon.invalid
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
Campus
||  \\    of NJ     | Office of Advanced Research Computing - MSB C630,
Newark
    `'

On Jul 7, 2017, at 00:47, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>
wrote:

Hi

I'm getting what appear to be false-positives for the port test that is
monitoring the LISTEN socket for port 22, as opened by the sshd daemon. A
few times a month, Xymon will show that the server is not listening on port
22, and 5 minutes later, the listening port is back again. The sshd process
has never crashed or been reconfigured (eg with SIGHUP), and no other
listening ports are showing the same behaviour.  The client messages for
the server during these events are complete and uncorrupted.

The simplest fix is to use delayred to suppress alerts for 5 minutes.
However, I would like to work out what's causing this behaviour. I don't
believe this a problem with Xymon at all, and instead the netstat output in
the client message is exactly what the OS provided the Xymon client. My
guess is that it's due to a the way sshd works - perhaps it periodically
rebinds to the socket - but nothing in the sshd logs seems to correlate
with these events. If anyone can suggest what might be causing this, or how
to investigate further, I'd be grateful.

This problem happens for about a quarter of the servers in a pool, and no
others. All servers are identical in OS, software and general
configuration, but the servers affected by this tend to be the ones taking
the most traffic and under the most load (although there's plenty of spare
CPU cycles even on the most heavily-used server). I have two Xymon servers,
each monitoring independently of the other, and this problem is reported by
both Xymon servers, although at completely different dates and times.

Cheers
Jeremy

Have you considered adding the SSH network test, in conjunction?

--
Mike Burger
http://www.bubbanfriends.org

"It's always suicide-mission this, save-the-planet that. No one ever just
stops by to say 'hi' anymore." --Colonel Jack O'Neill, SG1

Xymon "port" check intermittent failures for ssh TCP port 22 state=LISTEN 🔗 link

Xymon "port" check intermittent failures for ssh TCP port 22 state=LISTEN