xymon hostdata module going rogue
On Wed, June 10, 2015 10:01 am, Scot Kreienkamp wrote:
Hi everyone, I have a xymon server running 4.3.21 that seems to be accumulating processes like these: hobbit 28430 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28435 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28440 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28444 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28449 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> hobbit 28452 0.0 0.0 0 0 ? Z 12:50 0:00 [xymond_hostdata] <defunct> It seemed related to drop messages, so I did a test. [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 161 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 162 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 163 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 164 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 165 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 166 [hobbit at retv6100 temp]$ xymon 127.0.0.1 "drop amds7101_na_lzb_hq" ; ps auxw |grep xymond_hostdata |wc -l 167 So every time I send a drop message I get a defunct process hanging out. Bug in Xymon? This is on RHEL5, xymon 4.3.21. Thanks!
Scot, Some background: When doing a full drop on a host, xymond_hostdata (and xymond_history, IIRC) forks to perform the recursive directory removal of history files and whatnot in the background, then exits out. That's why it corresponds to those events. Looks like xymond_hostdata.c is missing a SIGCHLD registration, which is causing the defunct processes to stack up. Strangely, I haven't observed this behavior on RHEL6 at all though, even though we're dropping hosts all the time. Odd. The following patch should fix the issue for you, I believe. Regards, -jc
Attachments (1)