xymonnet timeouts?

4 messages in this thread

list Richard Hamilton · Wed, 15 Feb 2017 10:50:51 -0500 ·

I noticed I was getting these when a host (marked dialup) was down; turns
out it's because there was an RPC test, and rpcinfo has no option to choose
a reasonable timeout; trying to run it against a host that's down or
unreachable takes nearly ten minutes to time out!

What I don't understand, is why, given the conn test was enabled and not
green or yellow, it was trying to do other network tests on that host.

Here's the host line:
192.168.0.56 lapple-sierra # dialup CLIENT:lapple-sierra.pri
noflap=location ssh ntp rpc=mountd,nlockmgr,nfs,rpcbind,rquotad,status
NOCOLUMNS:files multihomed NOPROPPURPLE:+location
NOPROPYELLOW:+cpu,+location

(location is an client extension script, not relevant to the problem at
hand)

list Japheth Cleaver · Wed, 15 Feb 2017 08:35:51 -0800 ·

▸ quoted from Richard Hamilton

On 2/15/2017 7:50 AM, Richard Hamilton wrote:

I noticed I was getting these when a host (marked dialup) was down; turns out it's because there was an RPC test, and rpcinfo has no option to choose a reasonable timeout; trying to run it against a host that's down or unreachable takes nearly ten minutes to time out!

What I don't understand, is why, given the conn test was enabled and not green or yellow, it was trying to do other network tests on that host.

Here's the host line:


192.168.0.56lapple-sierra# dialup CLIENT:lapple-sierra.pri noflap=location ssh ntp rpc=mountd,nlockmgr,nfs,rpcbind,rquotad,status NOCOLUMNS:files multihomed NOPROPPURPLE:+location NOPROPYELLOW:+cpu,+location

(location is an client extension script, not relevant to the problem at hand)

Interestingly, this appears to be intentional -- dialup tests are not considered "down" internally (clear is N/A more than a down state) and so they aren't bypassed later in the cycle when we get to running rpcinfo.

I'm not entirely certain on the history here. This smells like it should be a bug for precisely the reason you're seeing. Mass timeouts testing against things that are down. OTOH, there may be cases where things are intermittently unpingable and yet people are expecting other testing to continue on. 'dialup' is a bit lesser used nowadays, which may be why this is less frequently hit.

There's logic in xymonnet that allows for internal flagging of something as actually up or down for purposes of testing (to handle things like badconn); this should probably become an option for control in the future.

Regards,
-jc

list Richard Hamilton · Wed, 15 Feb 2017 21:50:46 -0500 ·

In this case, "dialup" isn't literal, they're VMs under type II (hosted)
hypervisors - VirtualBox or Parallels, in this case.  Since the hosts don't
have gigantic amounts of RAM, the VMs are only brought up when needed
(testing, development, updates, or nostalgia for some other OS); but when
up, should be healthy, with all their usual services running.

Another dialup is my laptop, which is usually where I am, not necessarily
back home with the xymon server. :-)  Since it has neither builtin cellular
nor do I have an always-on portable cellular hotspot (although the phone
can do that duty occasionally in the absence of a proper one), there's no
way for it to be connected all the time, either.

Likewise, some non-infrastructure devices are dialup, because they're not
on all the time - like a WiFi picture frame, various iDevices, or a game
console.  If the printer didn't have energy saver mode, it would be a
dialup too, because it wouldn't be left on all the time.

Literal dialup with a modem may be rare enough nowadays, but there are
plenty of modern intermittently connected cases for which the functionality
is still useful, IMO.

One way or another, exposing a way to have network tests contingent on
basic connectivity, even when basic connectivity is optional (dialup),
would IMO help, a lot - especially for external tests, of which rpc is the
worst - ntp timeout is very quick by comparison; and RPC libraries come in
different enough flavors that rolling a portable version of rpcinfo with a
timeout option seems a bit tedious (I've looked at e.g. Solaris and Mac
code for rpcinfo, and they're very different internally; the Mac's seems
derived from a really old BSD flavor, more or less).


On Wed, Feb 15, 2017 at 11:35 AM, Japheth Cleaver <user-87556346d4af@xymon.invalid>

▸ quoted from Japheth Cleaver

wrote:

On 2/15/2017 7:50 AM, Richard Hamilton wrote:

I noticed I was getting these when a host (marked dialup) was down; turns
out it's because there was an RPC test, and rpcinfo has no option to choose
a reasonable timeout; trying to run it against a host that's down or
unreachable takes nearly ten minutes to time out!

What I don't understand, is why, given the conn test was enabled and not
green or yellow, it was trying to do other network tests on that host.

Here's the host line:
192.168.0.56 lapple-sierra # dialup CLIENT:lapple-sierra.pri
noflap=location ssh ntp rpc=mountd,nlockmgr,nfs,rpcbind,rquotad,status
NOCOLUMNS:files multihomed NOPROPPURPLE:+location
NOPROPYELLOW:+cpu,+location

(location is an client extension script, not relevant to the problem at
hand)


Interestingly, this appears to be intentional -- dialup tests are not
considered "down" internally (clear is N/A more than a down state) and so
they aren't bypassed later in the cycle when we get to running rpcinfo.

I'm not entirely certain on the history here. This smells like it should
be a bug for precisely the reason you're seeing. Mass timeouts testing
against things that are down. OTOH, there may be cases where things are
intermittently unpingable and yet people are expecting other testing to
continue on. 'dialup' is a bit lesser used nowadays, which may be why this
is less frequently hit.

There's logic in xymonnet that allows for internal flagging of something
as actually up or down for purposes of testing (to handle things like
badconn); this should probably become an option for control in the future.

Regards,
-jc

list Richard Hamilton · Sat, 18 Feb 2017 15:10:08 -0500 ·

Ok, I have a really disgusting workaround for xymonnet timeouts on
rpcinfo.  Set RPCINFO in xymonserver.cfg to point to a wrapper like the
following, which will kill off the rpcinfo process after 9 seconds, if it
hasn't already finished.  This seems to give the expected result whether
the host being tested is up or down, without causing xymonnet timeout
errors.  It does seem necessary to have the return code be 0 or 1, and not
let it default to the return code of "wait" (which could be e.g. 143 if the
process was killed, and would look like a different kind of error to
xymonnet).
======== cut here ========
#! /bin/sh
/usr/bin/rpcinfo ${1+"${@}"} &
pid="${!}"
(sleep 9; kill -0 "${pid}" && kill "${pid}") 2>/dev/null &
wait "${pid}"
if [ $? -eq 0 ]
then
        exit 0
else
        exit 1
fi
======== cut here ========
For a "dialup" host (actually a VM that wasn't running), the result was
reasonable: clear, and output of
Sat Feb 18 15:04:09 2017 rpc ok, Service unavailable

Dialup host or service


Could not connect to the portmapper service
Command: /export/home/xymon/server/bin/rpcinfo -p 192.168.0.56 2>&1

/export/home/xymon/server/bin/rpcinfo[6]: wait: 13244: Terminated


Still, as I said, this is rather disgusting, and I'd hope that not running
external tests like rpcinfo or ntp when the conn test failed would be an
option in the future.


On Wed, Feb 15, 2017 at 9:50 PM, Richard Hamilton <user-af55987f6d56@xymon.invalid>

▸ quoted from Richard Hamilton

wrote:

In this case, "dialup" isn't literal, they're VMs under type II (hosted)
hypervisors - VirtualBox or Parallels, in this case.  Since the hosts don't
have gigantic amounts of RAM, the VMs are only brought up when needed
(testing, development, updates, or nostalgia for some other OS); but when
up, should be healthy, with all their usual services running.

Another dialup is my laptop, which is usually where I am, not necessarily
back home with the xymon server. :-)  Since it has neither builtin cellular
nor do I have an always-on portable cellular hotspot (although the phone
can do that duty occasionally in the absence of a proper one), there's no
way for it to be connected all the time, either.

Likewise, some non-infrastructure devices are dialup, because they're not
on all the time - like a WiFi picture frame, various iDevices, or a game
console.  If the printer didn't have energy saver mode, it would be a
dialup too, because it wouldn't be left on all the time.

Literal dialup with a modem may be rare enough nowadays, but there are
plenty of modern intermittently connected cases for which the functionality
is still useful, IMO.

One way or another, exposing a way to have network tests contingent on
basic connectivity, even when basic connectivity is optional (dialup),
would IMO help, a lot - especially for external tests, of which rpc is the
worst - ntp timeout is very quick by comparison; and RPC libraries come in
different enough flavors that rolling a portable version of rpcinfo with a
timeout option seems a bit tedious (I've looked at e.g. Solaris and Mac
code for rpcinfo, and they're very different internally; the Mac's seems
derived from a really old BSD flavor, more or less).


On Wed, Feb 15, 2017 at 11:35 AM, Japheth Cleaver <user-87556346d4af@xymon.invalid>
wrote:

On 2/15/2017 7:50 AM, Richard Hamilton wrote:

I noticed I was getting these when a host (marked dialup) was down; turns
out it's because there was an RPC test, and rpcinfo has no option to choose
a reasonable timeout; trying to run it against a host that's down or
unreachable takes nearly ten minutes to time out!

What I don't understand, is why, given the conn test was enabled and not
green or yellow, it was trying to do other network tests on that host.

Here's the host line:
192.168.0.56 lapple-sierra # dialup CLIENT:lapple-sierra.pri
noflap=location ssh ntp rpc=mountd,nlockmgr,nfs,rpcbind,rquotad,status
NOCOLUMNS:files multihomed NOPROPPURPLE:+location
NOPROPYELLOW:+cpu,+location

(location is an client extension script, not relevant to the problem at
hand)


Interestingly, this appears to be intentional -- dialup tests are not
considered "down" internally (clear is N/A more than a down state) and so
they aren't bypassed later in the cycle when we get to running rpcinfo.

I'm not entirely certain on the history here. This smells like it should
be a bug for precisely the reason you're seeing. Mass timeouts testing
against things that are down. OTOH, there may be cases where things are
intermittently unpingable and yet people are expecting other testing to
continue on. 'dialup' is a bit lesser used nowadays, which may be why this
is less frequently hit.

There's logic in xymonnet that allows for internal flagging of something
as actually up or down for purposes of testing (to handle things like
badconn); this should probably become an option for control in the future.

Regards,
-jc

xymonnet timeouts? 🔗 link

xymonnet timeouts?