how to learn what is crashing my rrd handler?
list John Thurston
I'm having difficulty with my RRD handlers crashing and leaving gaps in my databases. I mentioned this back in April, 2014 but received no responses: http://lists.xymon.com/pipermail/xymon/2014-April/039547.html Since that time, I stumbled across a client which was occasionally sending me empty messages of the "data" type. When I stopped that behavior, the behavior was corrected. The behavior has returned and I am not able to figure out how to find the cause. I _suspect_ it is another client sending me empty messages, but how do I find it now that I have several hundred clients sending "data" messages? -- Do things because you should, not just because you can. John Thurston XXX-XXX-XXXX user-ce4d79d99bab@xymon.invalid Enterprise Technology Services Department of Administration State of Alaska
list Japheth Cleaver
▸
On Tue, August 26, 2014 11:04 am, John Thurston wrote:
I'm having difficulty with my RRD handlers crashing and leaving gaps in my databases. I mentioned this back in April, 2014 but received no responses: http://lists.xymon.com/pipermail/xymon/2014-April/039547.html Since that time, I stumbled across a client which was occasionally sending me empty messages of the "data" type. When I stopped that behavior, the behavior was corrected. The behavior has returned and I am not able to figure out how to find the cause. I _suspect_ it is another client sending me empty messages, but how do I find it now that I have several hundred clients sending "data" messages? --
As an initial step, run xymond_rrd in --debug mode... You can send the pid a -USR2 signal to toggle this setting without bouncing the process itself (be sure you're signalling xymond_rrd and not its xymond_channel parent). Ideally that can help you see what exactly it was processing before it went down. If it's a full on crash, a backtrace from a core file would be helpful for debugging after a segfault or something similar. HTH, -jc
list John Thurston
▸
On 8/26/2014 11:16 AM, J.C. Cleaver wrote:
On Tue, August 26, 2014 11:04 am, John Thurston wrote:I'm having difficulty with my RRD handlers crashing and leaving gaps in my databases. I mentioned this back in April, 2014 but received no responses: http://lists.xymon.com/pipermail/xymon/2014-April/039547.html
- snip -
▸
I _suspect_ it is another client sending me empty messages, but how do I find it now that I have several hundred clients sending "data" messages? --
As an initial step, run xymond_rrd in --debug mode... You can send the pid a -USR2 signal to toggle this setting without bouncing the process itself (be sure you're signalling xymond_rrd and not its xymond_channel parent).
Ahh. I figured out what I was doing wrong. I was placing --debug in the wrong place on the line. When I put it at the end, I get debug output from the 'data' handler rather than the parent. When I do so, the log contains:
2014-08-27 09:02:07 Peer not up, flushing message queue 2014-08-27 09:02:08 Peer not up, flushing message queue 2014-08-27 09:02:09 Peer not up, flushing message queue 2014-08-27 09:02:10 Peer not up, flushing message queue 2014-08-27 09:02:10 Peer not up, flushing message queue 2014-08-27 09:02:16 Peer not up, flushing message queue 2014-08-27 09:02:17 Peer not up, flushing message queue 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/rrddefinitions.cfg 4595 2014-08-27 09:02:19 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 4595 2014-08-27 09:02:19 Got 230 bytes 4595 2014-08-27 09:02:19 xymond_rrd: Got message 2103 @@data#2103/soapsgdc02.soa.alaska.gov|1409158937.159587|10.210.36.22||soapsgdc02.soa.alaska.gov|trends||ETS/MsgDir 4595 2014-08-27 09:02:19 startpos 230, fillpos 230, endpos -1 4595 2014-08-27 09:02:19 Transport setup is: 4595 2014-08-27 09:02:19 xymondportnumber = 1984 4595 2014-08-27 09:02:19 xymonproxyhost = NONE 4595 2014-08-27 09:02:19 xymonproxyport = 0 4595 2014-08-27 09:02:19 Recipient listed as '146.63.81.42' 4595 2014-08-27 09:02:19 Standard protocol on port 1984 4595 2014-08-27 09:02:19 Will connect to address 146.63.81.42 port 1984 4595 2014-08-27 09:02:19 Connect status is 0 4595 2014-08-27 09:02:19 Sent 16 bytes 4595 2014-08-27 09:02:19 Read 8192 bytes 4595 2014-08-27 09:02:19 Read 32767 bytes 4595 2014-08-27 09:02:19 Read 28440 bytes 4595 2014-08-27 09:02:19 Closing connection 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/analysis.cfg 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/protocols.cfg 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/protocols.d/soaprotocols.cfg 4595 2014-08-27 09:02:19 Service list dump 4595 2014-08-27 09:02:19 Name : ftp 4595 2014-08-27 09:02:19 Sendtext: quit\r\n 4595 2014-08-27 09:02:19 Sendlen : 6 4595 2014-08-27 09:02:19 Exp.text: 220 4595 2014-08-27 09:02:19 Exp.len : 3 4595 2014-08-27 09:02:19 Exp.ofs : 0 4595 2014-08-27 09:02:19 Flags : 1 4595 2014-08-27 09:02:19 Port : 21 4595 2014-08-27 09:02:19 Name : ftps 4595 2014-08-27 09:02:19 Sendtext: quit\r\n 4595 2014-08-27 09:02:19 Sendlen : 6 4595 2014-08-27 09:02:19 Exp.text: 220 4595 2014-08-27 09:02:19 Exp.len : 3 4595 2014-08-27 09:02:19 Exp.ofs : 0 4595 2014-08-27 09:02:19 Flags : 5 4595 2014-08-27 09:02:19 Port : 990 4595 2014-08-27 09:02:19 Name : ssh 4595 2014-08-27 09:02:19 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 4595 2014-08-27 09:02:19 Sendlen : 21 4595 2014-08-27 09:02:19 Exp.text: SSH 4595 2014-08-27 09:02:19 Exp.len : 3 4595 2014-08-27 09:02:19 Exp.ofs : 0 4595 2014-08-27 09:02:19 Flags : 1 4595 2014-08-27 09:02:19 Port : 22 4595 2014-08-27 09:02:19 Name : ssh1 4595 2014-08-27 09:02:19 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 4595 2014-08-27 09:02:19 Sendlen : 21 4595 2014-08-27 09:02:19 Exp.text: SSH 4595 2014-08-27 09:02:19 Exp.len : 3 4595 2014-08-27 09:02:19 Exp.ofs : 0 4595 2014-08-27 09:02:19 Flags : 1 4595 2014-08-27 09:02:19 Port : 22 4595 2014-08-27 09:02:19 Name : ssh2 4595 2014-08-27 09:02:19 Sendtext: SSH-2.0-OpenSSH_4.1\r\n 4595 2014-08-27 09:02:19 Sendlen : 21 4595 2014-08-27 09:02:19 Exp.text: SSH 4595 2014-08-27 09:02:19 Exp.len : 3 4595 2014-08-27 09:02:19 Exp.ofs : 0 4595 2014-08-27 09:02:19 Flags : 1 4595 2014-08-27 09:02:19 Port : 22 4595 2014-08-27 09:02:19 Name : telnet 4595 2014-08-27 09:02:19 2014-08-27 09:02:21 Child process 4595 died: Signal 6 2014-08-27 09:02:24 Peer at 0.0.0.0:0 failed: Broken pipe 2014-08-27 09:02:24 Peer not up, flushing message queue 2014-08-27 09:02:24 Peer not up, flushing message queue
and the stack from the core file (using pstack)
fee5ebd4 _lwp_kill (6, 0, 0, fee3e0f0, ffffffff, 6) + 8 fedd29f0 abort (0, 1, 6666c, ffb04, feed5518, 0) + 110 0003bb94 sigsegv_handler (b, 0, ffbfb588, 1, 0, 544a8) + 30 fee5b00c __sighndlr (b, 0, ffbfb588, 3bb64, 0, 1) + c fee4f6bc call_user_handler (b, 0, 0, 0, fed32a00, ffbfb588) + 3b8 fee4f8a4 sigacthandler (b, 0, ffbfb588, 20, 0, 0) + 60 --- called from signal handler with signal 11 (SIGSEGV) --- fedc2d50 strlen (53e37, ffbfc804, ffbfbdf9, 0, 0, 0) + 50 fee319d4 vfprintf (71990, 53e28, ffbfc800, 0, a0afc, fee314d4) + ec 0002f24c dbgprintf (53e28, 0, e9768, 6f800, 71990, 6d000) + a0 0003347c dump_tcp_services (53e88, 53ea0, 53eb8, c0, a0, 67f98) + a0 00033d70 init_tcp_services (168a78, 620, 67f98, 54060, 600, 168430) + 848 0002f858 rrd_setup (98906, 6d000, 6d000, 80808080, 6d000, 0) + 164 0002fc4c find_xymon_rrd (988f4, 492e8, 53fe08c6, 53fe08c6, 988c2, 2e) + 4 00048cb0 main (98907, ffbfdba4, 988fc, 68800, 3, 49528) + 728 00015d2c _start (0, 0, 0, 0, 0, 0) + 5c
Which, if I'm reading it correctly, makes me think the application tried to read off the end of a string.
▸
--
Do things because you should, not just because you can.
John Thurston XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list John Thurston
▸
On 8/27/2014 9:27 AM, John Thurston wrote:
On 8/26/2014 11:16 AM, J.C. Cleaver wrote:On Tue, August 26, 2014 11:04 am, John Thurston wrote:I'm having difficulty with my RRD handlers crashing and leaving gaps in my databases. I mentioned this back in April, 2014 but received no responses: http://lists.xymon.com/pipermail/xymon/2014-April/039547.html- snip -I _suspect_ it is another client sending me empty messages, but how do I find it now that I have several hundred clients sending "data" messages? --As an initial step, run xymond_rrd in --debug mode... You can send the pid a -USR2 signal to toggle this setting without bouncing the process itself (be sure you're signalling xymond_rrd and not its xymond_channel parent).Ahh. I figured out what I was doing wrong. I was placing --debug in the wrong place on the line. When I put it at the end, I get debug output from the 'data' handler rather than the parent. When I do so, the log contains:2014-08-27 09:02:07 Peer not up, flushing message queue 2014-08-27 09:02:08 Peer not up, flushing message queue 2014-08-27 09:02:09 Peer not up, flushing message queue 2014-08-27 09:02:10 Peer not up, flushing message queue 2014-08-27 09:02:10 Peer not up, flushing message queue 2014-08-27 09:02:16 Peer not up, flushing message queue 2014-08-27 09:02:17 Peer not up, flushing message queue 4595 2014-08-27 09:02:19 Opening file /opt/xymon/server/etc/rrddefinitions.cfg 4595 2014-08-27 09:02:19 Want msg 1, startpos 0, fillpos 0, endpos -1, usedbytes=0, bufleft=528383 4595 2014-08-27 09:02:19 Got 230 bytes 4595 2014-08-27 09:02:19 xymond_rrd: Got message 2103 @@data#2103/soapsgdc02.soa.alaska.gov|1409158937.159587|10.210.36.22||soapsgdc02.soa.alaska.gov|trends||ETS/MsgDir
- snip -
▸
4595 2014-08-27 09:02:19 Exp.len : 3 4595 2014-08-27 09:02:19 Exp.ofs : 0 4595 2014-08-27 09:02:19 Flags : 1 4595 2014-08-27 09:02:19 Port : 22 4595 2014-08-27 09:02:19 Name : telnet 4595 2014-08-27 09:02:19 2014-08-27 09:02:21 Child process 4595 died: Signal 6 2014-08-27 09:02:24 Peer at 0.0.0.0:0 failed: Broken pipe 2014-08-27 09:02:24 Peer not up, flushing message queue 2014-08-27 09:02:24 Peer not up, flushing message queueand the stack from the core file (using pstack)fee5ebd4 _lwp_kill (6, 0, 0, fee3e0f0, ffffffff, 6) + 8 fedd29f0 abort (0, 1, 6666c, ffb04, feed5518, 0) + 110 0003bb94 sigsegv_handler (b, 0, ffbfb588, 1, 0, 544a8) + 30 fee5b00c __sighndlr (b, 0, ffbfb588, 3bb64, 0, 1) + c fee4f6bc call_user_handler (b, 0, 0, 0, fed32a00, ffbfb588) + 3b8 fee4f8a4 sigacthandler (b, 0, ffbfb588, 20, 0, 0) + 60 --- called from signal handler with signal 11 (SIGSEGV) --- fedc2d50 strlen (53e37, ffbfc804, ffbfbdf9, 0, 0, 0) + 50 fee319d4 vfprintf (71990, 53e28, ffbfc800, 0, a0afc, fee314d4) + ec 0002f24c dbgprintf (53e28, 0, e9768, 6f800, 71990, 6d000) + a0 0003347c dump_tcp_services (53e88, 53ea0, 53eb8, c0, a0, 67f98) + a0 00033d70 init_tcp_services (168a78, 620, 67f98, 54060, 600, 168430) + 848 0002f858 rrd_setup (98906, 6d000, 6d000, 80808080, 6d000, 0) + 164 0002fc4c find_xymon_rrd (988f4, 492e8, 53fe08c6, 53fe08c6, 988c2, 2e) + 4 00048cb0 main (98907, ffbfdba4, 988fc, 68800, 3, 49528) + 728 00015d2c _start (0, 0, 0, 0, 0, 0) + 5cWhich, if I'm reading it correctly, makes me think the application tried to read off the end of a string.
And I think the thing which ran off the end of the string was the debug process :p I'm still looking at source (and my C is very, very bad), but I suspect the debug print process is choking on the absence of an attribute in the procols.cfg When I turn debug off, my rrd handlers are working much better.
▸
--
Do things because you should, not just because you can.
John Thurston XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska
list Jeremy Laidman
▸
On 28 August 2014 05:10, John Thurston <user-ce4d79d99bab@xymon.invalid> wrote:
And I think the thing which ran off the end of the string was the debug process :p I'm still looking at source (and my C is very, very bad), but I suspect the debug print process is choking on the absence of an attribute in the procols.cfg
The "Sendtext" line has only the date string, so that's probably the data
that caused the crash. I'm guessing you haven't re-defined the
protocols.cfg entry for [telnet], so I'm struggling to imagine where the
bad data could have come from. But perhaps worth removing the "telnet"
test for that host and see if the crashing stops.
J
list John Thurston
▸
On 8/27/2014 4:08 PM, Jeremy Laidman wrote:
On 28 August 2014 05:10, John Thurston <user-ce4d79d99bab@xymon.invalid
<mailto:user-ce4d79d99bab@xymon.invalid>> wrote:
And I think the thing which ran off the end of the string was the
debug process :p I'm still looking at source (and my C is very, very
bad), but I suspect the debug print process is choking on the
absence of an attribute in the procols.cfg
The "Sendtext" line has only the date string, so that's probably the
data that caused the crash. I'm guessing you haven't re-defined the
protocols.cfg entry for [telnet], so I'm struggling to imagine where the
bad data could have come from. But perhaps worth removing the "telnet"
test for that host and see if the crashing stops.The telnet test is defined for only one host (out of 600) in my hosts.cfg, and it is never the host identified in the remnants of the crash. It looks like (with 4.3.17), in debug mode, lines 324-340 of netservices.c dump the currently defined service list. They correctly send the contents of the service list for: ftp ftps ssh ssh1 ssh2 and they print the name for: telent and then they die when they are asked to print the Sendtext for "telnet" My protocols.cfg contains:
[ssh|ssh1|ssh2] send "SSH-2.0-OpenSSH_4.1\r\n" expect "SSH" options banner port 22 [telnet] options banner,telnet port 23
There is no "send" attribute defined. I don't think there has ever been such an attribute defined. The files in the source directories (~/xymon-4.3.17/xymonnet/protocols.cfg) do not contain a "send" attribute. The "send" attribute is defined for most, but not all of the other protocols in the file. Has anyone running 4.3.17 successfully run their rrd_data handler in debug mode? I don't think it can do it. This looks to me like a code defect, wherein the dump_tcp_services function is not checking for a null string prior to finding its length and passing the information to dbgprintf. I'm just a lowly sysadmin whose last exposure to C was more than 25 years ago. Can someone with some C experience confirm or deny my hypothesis?
▸
--
Do things because you should, not just because you can.
John Thurston XXX-XXX-XXXX
user-ce4d79d99bab@xymon.invalid
Enterprise Technology Services
Department of Administration
State of Alaska