Xymon Mailing List Archive search

"Discarding timed-out partial msg" Error Messages

4 messages in this thread

list Matt Vander Werf · Tue, 24 Nov 2015 09:29:43 -0500 ·
Hey all,

Lately, I've been seeing quite a few error messages show up in xymond
indicating that it was discarding a timed-out partial message from some
machine.

i.e.

Latest error messages:
Discarding timed-out partial msg from X.X.X.X

They seem to be happening sporadically but more often than usual as of
late. Maybe one or two every couple of days or so. They don't seem to be
coming from the same machine/machines either.

Is this something I should be worried about? Are there any side-effects
from this happening too much?

What are the causes of this happening? Any way to make it not happen as
much?


Any ideas or advice is greatly appreciated!

Thanks!!

--
Matt Vander Werf
list Japheth Cleaver · Tue, 24 Nov 2015 12:17:50 -0800 ·
quoted from Matt Vander Werf

On Tue, November 24, 2015 6:29 am, Matt Vander Werf wrote:
Hey all,

Lately, I've been seeing quite a few error messages show up in xymond
indicating that it was discarding a timed-out partial message from some
machine.

i.e.

Latest error messages:
Discarding timed-out partial msg from X.X.X.X

They seem to be happening sporadically but more often than usual as of
late. Maybe one or two every couple of days or so. They don't seem to be
coming from the same machine/machines either.

Is this something I should be worried about? Are there any side-effects
from this happening too much?

What are the causes of this happening? Any way to make it not happen as
much?


Any ideas or advice is greatly appreciated!

Thanks!!

Broadly speaking, this is a result of the entire message not making it in
in the time allotted by xymond, which is 10s by default. It could be the
result of network congestion issues or packet loss, slow sender
performance, or slow xymon server performance.

A quick fix might be to increase the --timeout= option to xymond to
something like 15 or 20s.

If a netstat shows tons of simultaneous connections, you could also
increase --lqueue= to 768 or 1024.

Are there any patterns on the clients/senders that are affected? Unusually
huge messages being sent over slow connections?

If there isn't a network issue per se, and there are no local network
errors (or you're seeing the reports about messages from all over the
place), then it's time to look at network performance tuning on the xymon
box. Consider the various tcp* options via sysctl (recycle and reuse in
particular). If xymonnet is running on the same system (and you're doing)
high concurrency testing, be sure to increase your ip_local_port_range for
outbound connections.

http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/ is a nice
resource for that.


HTH,

-jc
list Matt Vander Werf · Fri, 4 Dec 2015 06:49:48 -0500 ·
Hi J.C.,

Thanks for the e-mail and advice!

A couple of questions:

What's the default --lqueue value that Xymon uses? (Is there a way to see
what it's using?)

What exactly is your definition of "tons of simultaneous connections" here?
Can you give me a number or range that you think would warrant increasing
the --lqueue value?

Could it be from clients/senders with longer than usual process listings?
Or other clientlog statistics? (But still under the max client message
value.)

How would I be able to tell if there are long messages being sent in if the
long messages are being discarded?


The clients/senders are all different and there doesn't seem to be a
pattern that I can see. Some hosts are showing up more then once though.
All the connections to these machines SHOULD be coming in at the same
connection speed....

I'm not seeing any network issues over any of our switches. And I'm not
seeing any significant or unusual network traffic on the machines in
question around the time of the time-out error messages (although it could
be a very brief spike in traffic that isn't being seen since the status
message is being discarded).

I'll definitely look into the possibility of doing some TCP tuning on the
Xymon server machine!

Thanks again!

--
Matt Vander Werf

On Tue, Nov 24, 2015 at 3:17 PM, J.C. Cleaver <user-87556346d4af@xymon.invalid>
quoted from Japheth Cleaver
wrote:
On Tue, November 24, 2015 6:29 am, Matt Vander Werf wrote:
Hey all,

Lately, I've been seeing quite a few error messages show up in xymond
indicating that it was discarding a timed-out partial message from some
machine.

i.e.

Latest error messages:
Discarding timed-out partial msg from X.X.X.X

They seem to be happening sporadically but more often than usual as of
late. Maybe one or two every couple of days or so. They don't seem to be
coming from the same machine/machines either.

Is this something I should be worried about? Are there any side-effects
from this happening too much?

What are the causes of this happening? Any way to make it not happen as
much?


Any ideas or advice is greatly appreciated!

Thanks!!

Broadly speaking, this is a result of the entire message not making it in
in the time allotted by xymond, which is 10s by default. It could be the
result of network congestion issues or packet loss, slow sender
performance, or slow xymon server performance.

A quick fix might be to increase the --timeout= option to xymond to
something like 15 or 20s.

If a netstat shows tons of simultaneous connections, you could also
increase --lqueue= to 768 or 1024.

Are there any patterns on the clients/senders that are affected? Unusually
huge messages being sent over slow connections?

If there isn't a network issue per se, and there are no local network
errors (or you're seeing the reports about messages from all over the
place), then it's time to look at network performance tuning on the xymon
box. Consider the various tcp* options via sysctl (recycle and reuse in
particular). If xymonnet is running on the same system (and you're doing)
high concurrency testing, be sure to increase your ip_local_port_range for
outbound connections.

http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/ is a nice
resource for that.


HTH,

-jc

list Japheth Cleaver · Tue, 8 Dec 2015 07:50:06 -0800 ·
Hi Matt,

Sorry for the delay. Had some unexpected time away from the keyboard this
weekend.

Responses inline.
quoted from Matt Vander Werf


On Fri, December 4, 2015 3:49 am, Matt Vander Werf wrote:
Hi J.C.,

Thanks for the e-mail and advice!

A couple of questions:

What's the default --lqueue value that Xymon uses? (Is there a way to see
what it's using?)

What exactly is your definition of "tons of simultaneous connections"
here?
Can you give me a number or range that you think would warrant increasing
the --lqueue value?

The default is 512, which is compiled in. This really won't need to be
increased unless xymond is being bogged down with lots of *literally*
simultaneous waiting connections. It can be increased, but there's
probably another sort of problem happening: either slow connectivity, high
CPU load, or "backpressure" from too many channel workers causing xymond
itself to be unable to keep up. I'm trying to think back and I don't think
I had cause to increase it until SN was regularly hitting the 2500 msgs/s
range, and it was lowered back down once other performance bottlenecks and
some packet loss were identified.

Try stracing xymond and seeing what it's doing. If there's a lot of
waiting happening for network reading, that might be a sign that lqueue
increasing could help. 768 or 1024 should be more than sufficient.
Anything more than that except at bursts means there's some other backlog.
quoted from Matt Vander Werf

Could it be from clients/senders with longer than usual process listings?
Or other clientlog statistics? (But still under the max client message
value.)
It's possible, but unless you're bandwidth restricted somewhere senders
should generally still be able to complete in the default time frame. If
you *are* bandwith restricted then that's definitely something to
consider, especially if the machines you're having problems with have a
lot of burst network activity. (Speaking of burst network activity, try
commenting out the 'netstat' output in the client if you don't have any
port checks against the host.)
quoted from Matt Vander Werf

How would I be able to tell if there are long messages being sent in if
the
long messages are being discarded?
Yeah, this should probably be added in. Truncated messages have their
first line displayed, but it's not so much a 'discard' here as it is a
network timeout first and foremost.

An strace with the -s 4096 (or some high number) might be able to catch
the first bit of a read from the client if you're lucky there...


HTH,
-jc