Xymon Mailing List Archive search

how to learn what is crashing my rrd handler?

list John Thurston
Wed, 27 Aug 2014 11:10:16 -0800
Message-Id: <user-7f0de384325a@xymon.invalid>

On 8/27/2014 9:27 AM, John Thurston wrote:
On 8/26/2014 11:16 AM, J.C. Cleaver wrote:
On Tue, August 26, 2014 11:04 am, John Thurston wrote:
I'm having difficulty with my RRD handlers crashing and leaving gaps in
my databases.

I mentioned this back in April, 2014 but received no responses:
http://lists.xymon.com/pipermail/xymon/2014-April/039547.html
- snip -
I _suspect_ it is another client sending me empty messages, but how do I
find it now that I have several hundred clients sending "data" messages?
--
As an initial step, run xymond_rrd in --debug mode... You can send the
pid
a -USR2 signal to toggle this setting without bouncing the process itself
(be sure you're signalling xymond_rrd and not its xymond_channel parent).
Ahh. I figured out what I was doing wrong. I was placing --debug in the
wrong place on the line. When I put it at the end, I get debug output
from the 'data' handler rather than the parent.

When I do so, the log contains:
2014-08-27 09:02:07 Peer not up, flushing message queue
2014-08-27 09:02:08 Peer not up, flushing message queue
2014-08-27 09:02:09 Peer not up, flushing message queue
2014-08-27 09:02:10 Peer not up, flushing message queue
2014-08-27 09:02:10 Peer not up, flushing message queue
2014-08-27 09:02:16 Peer not up, flushing message queue
2014-08-27 09:02:17 Peer not up, flushing message queue
4595 2014-08-27 09:02:19 Opening file
/opt/xymon/server/etc/rrddefinitions.cfg
4595 2014-08-27 09:02:19 Want msg 1, startpos 0, fillpos 0, endpos -1,
usedbytes=0, bufleft=528383
4595 2014-08-27 09:02:19 Got 230 bytes
4595 2014-08-27 09:02:19 xymond_rrd: Got message 2103
@@data#2103/soapsgdc02.soa.alaska.gov|1409158937.159587|10.210.36.22||soapsgdc02.soa.alaska.gov|trends||ETS/MsgDir
- snip -
4595 2014-08-27 09:02:19    Exp.len : 3
4595 2014-08-27 09:02:19    Exp.ofs : 0
4595 2014-08-27 09:02:19    Flags   : 1
4595 2014-08-27 09:02:19    Port    : 22
4595 2014-08-27 09:02:19  Name      : telnet
4595 2014-08-27 09:02:19 2014-08-27 09:02:21 Child process 4595 died:
Signal 6
2014-08-27 09:02:24 Peer at 0.0.0.0:0 failed: Broken pipe
2014-08-27 09:02:24 Peer not up, flushing message queue
2014-08-27 09:02:24 Peer not up, flushing message queue
and the stack from the core file (using pstack)
 fee5ebd4 _lwp_kill (6, 0, 0, fee3e0f0, ffffffff, 6) + 8
 fedd29f0 abort    (0, 1, 6666c, ffb04, feed5518, 0) + 110
 0003bb94 sigsegv_handler (b, 0, ffbfb588, 1, 0, 544a8) + 30
 fee5b00c __sighndlr (b, 0, ffbfb588, 3bb64, 0, 1) + c
 fee4f6bc call_user_handler (b, 0, 0, 0, fed32a00, ffbfb588) + 3b8
 fee4f8a4 sigacthandler (b, 0, ffbfb588, 20, 0, 0) + 60
 --- called from signal handler with signal 11 (SIGSEGV) ---
 fedc2d50 strlen   (53e37, ffbfc804, ffbfbdf9, 0, 0, 0) + 50
 fee319d4 vfprintf (71990, 53e28, ffbfc800, 0, a0afc, fee314d4) + ec
 0002f24c dbgprintf (53e28, 0, e9768, 6f800, 71990, 6d000) + a0
 0003347c dump_tcp_services (53e88, 53ea0, 53eb8, c0, a0, 67f98) + a0
 00033d70 init_tcp_services (168a78, 620, 67f98, 54060, 600, 168430) +
848
 0002f858 rrd_setup (98906, 6d000, 6d000, 80808080, 6d000, 0) + 164
 0002fc4c find_xymon_rrd (988f4, 492e8, 53fe08c6, 53fe08c6, 988c2, 2e)
+ 4
 00048cb0 main     (98907, ffbfdba4, 988fc, 68800, 3, 49528) + 728
 00015d2c _start   (0, 0, 0, 0, 0, 0) + 5c
Which, if I'm reading it correctly, makes me think the application tried
to read off the end of a string.
And I think the thing which ran off the end of the string was the debug 
process :p I'm still looking at source (and my C is very, very bad), but 
I suspect the debug print process is choking on the absence of an 
attribute in the procols.cfg

When I turn debug off, my rrd handlers are working much better.
-- 
    Do things because you should, not just because you can.

John Thurston    XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska