Xymon Mailing List Archive search

increasing no. of hobbitd zombie's

4 messages in this thread

list Michael Heinecke · Tue, 25 Oct 2005 11:57:57 +0200 ·
Hi,
 
every "checkpoint-interval" i get a new hobbitd zombie process.
 
#> ps auxwww 
----snip ---
hobbit   25559  0.0  0.0     0    0 ?        Z    11:06   0:00 [hobbitd] <defunct>
hobbit   25917  0.0  0.0     0    0 ?        Z    11:16   0:00 [hobbitd] <defunct>
hobbit   26283  0.0  0.0     0    0 ?        Z    11:26   0:00 [hobbitd] <defunct>
hobbit   26648  0.0  0.0     0    0 ?        Z    11:36   0:00 [hobbitd] <defunct>
----snip ---
 
hobbitlaunch.cfg: [hobbitd] section:
 
CMD hobbitd --debug --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP

... the last zombie appears at 11:36 ..., this is the hobbitd.log (debug mode) - look's good, i think ;-)
 
...
2005-10-25 11:36:11 Sending heartbeat to pid 25202
2005-10-25 11:36:17 Sending heartbeat to pid 25202
2005-10-25 11:36:17 -> check_purple_status
2005-10-25 11:36:17 <- check_purple_status
2005-10-25 11:36:17 -> generate_stats
2005-10-25 11:36:17 <- generate_stats
2005-10-25 11:36:17 -> get_hts
2005-10-25 11:36:17 <- get_hts
2005-10-25 11:36:17 ->handle_status
2005-10-25 11:36:17 posting to status channel
2005-10-25 11:36:17 -> posttochannel
2005-10-25 11:36:17 Dropping message - no readers
2005-10-25 11:36:17 <-handle_status
2005-10-25 11:36:23 Sending heartbeat to pid 25202
2005-10-25 11:36:25 -> save_checkpoint
2005-10-25 11:36:25 <- save_checkpoint
2005-10-25 11:36:29 Sending heartbeat to pid 25202
2005-10-25 11:36:35 Sending heartbeat to pid 25202
...
 
This happens on Debian LINUX 3.1 'Sarge'. Analog config on different Solaris 8 SPARC Boxes => no problem.
 
On the Debian box, i have DISABLED bbdisplay in hobbitlaunch.cfg, because this box should only act as a kind of LAN probe (only bb-net and forwarding of LAN client stati to a central bbdisplay)
No big issue (my next step: increase the checkpoint interval ;-) - but maybe someone else may run into trouble.

:-) Michael

Michael Heinecke
CAX / UNIX&VoD

HanseNet Telekommunikation GmbH
Überseering 33 a, 22297 Hamburg
Telefon: +49 (0)40 23726-2768
Telefax: +49 (0)40 23726-3485

 <http://www.alice-dsl.de/>; http://www.alice-dsl.de,  <http://www.hansenet.de/>; http://www.hansenet.de
list Michael Heinecke · Tue, 25 Oct 2005 12:45:40 +0200 ·
Hi Arnoud,

ahhh... OK.
Then, i guess, it's not exclusively forced by my local setup. BTW, setting '--checkpoint-interval' up to 3600 sec's alters the behavior to the expected 'one zombie per hour'.

user-ae9b8668bcde@xymon.invalid : FYI ...

:-) Thanks, Michael


-----Ursprüngliche Nachricht-----
Von: Arnoud Post [mailto:user-714cc8ca2b55@xymon.invalid]
Gesendet: Dienstag, 25. Oktober 2005 12:13
An: Heinecke, Michael HTK
Betreff: Re: [hobbit] increasing no. of hobbitd zombie's


Same here on a default install on FreeBSD 5.3.

Best Regards,
Arnoud Post

Quoting user-c9ab24a501e1@xymon.invalid:
quoted from Michael Heinecke
Hi,

every "checkpoint-interval" i get a new hobbitd zombie process.

#> ps auxwww
----snip ---
hobbit   25559  0.0  0.0     0    0 ?        Z    11:06   0:00 [hobbitd] <defunct>
hobbit   25917  0.0  0.0     0    0 ?        Z    11:16   0:00 [hobbitd] <defunct>
hobbit   26283  0.0  0.0     0    0 ?        Z    11:26   0:00 [hobbitd] <defunct>
hobbit   26648  0.0  0.0     0    0 ?        Z    11:36   0:00 [hobbitd] <defunct>
----snip ---

hobbitlaunch.cfg: [hobbitd] section:

CMD hobbitd --debug --pidfile=$BBSERVERLOGS/hobbitd.pid --restart=$BBTMP/hobbitd.chk --checkpoint-file=$BBTMP/hobbitd.chk --checkpoint-interval=600 --log=$BBSERVERLOGS/hobbitd.log --admin-senders=127.0.0.1,$BBSERVERIP

... the last zombie appears at 11:36 ..., this is the hobbitd.log (debug mode) - look's good, i think ;-)

...
2005-10-25 11:36:11 Sending heartbeat to pid 25202
2005-10-25 11:36:17 Sending heartbeat to pid 25202
2005-10-25 11:36:17 -> check_purple_status
2005-10-25 11:36:17 <- check_purple_status
2005-10-25 11:36:17 -> generate_stats
2005-10-25 11:36:17 <- generate_stats
2005-10-25 11:36:17 -> get_hts
2005-10-25 11:36:17 <- get_hts
2005-10-25 11:36:17 ->handle_status
2005-10-25 11:36:17 posting to status channel
2005-10-25 11:36:17 -> posttochannel
2005-10-25 11:36:17 Dropping message - no readers
2005-10-25 11:36:17 <-handle_status
2005-10-25 11:36:23 Sending heartbeat to pid 25202
2005-10-25 11:36:25 -> save_checkpoint
2005-10-25 11:36:25 <- save_checkpoint
2005-10-25 11:36:29 Sending heartbeat to pid 25202
2005-10-25 11:36:35 Sending heartbeat to pid 25202
...

This happens on Debian LINUX 3.1 'Sarge'. Analog config on different Solaris 8 SPARC Boxes => no problem.

On the Debian box, i have DISABLED bbdisplay in hobbitlaunch.cfg, because this box should only act as a kind of LAN probe (only bb-net and forwarding of LAN client stati to a central bbdisplay)
No big issue (my next step: increase the checkpoint interval ;-) - but maybe someone else may run into trouble.

:-) Michael

Michael Heinecke
CAX / UNIX&VoD

HanseNet Telekommunikation GmbH
Überseering 33 a, 22297 Hamburg
Telefon: +49 (0)40 23726-2768
Telefax: +49 (0)40 23726-3485

<http://www.alice-dsl.de/>; http://www.alice-dsl.de,  <http://www.hansenet.de/>; http://www.hansenet.de

list Henrik Størner · Tue, 25 Oct 2005 12:50:49 +0200 ·
quoted from Michael Heinecke
On Tue, Oct 25, 2005 at 11:57:57AM +0200, user-c9ab24a501e1@xymon.invalid wrote:
Hi,
 
every "checkpoint-interval" i get a new hobbitd zombie process.
 
#> ps auxwww 
----snip ---
hobbit   25559  0.0  0.0     0    0 ?        Z    11:06   0:00 [hobbitd] <defunct>
hobbit   25917  0.0  0.0     0    0 ?        Z    11:16   0:00 [hobbitd] <defunct>
hobbit   26283  0.0  0.0     0    0 ?        Z    11:26   0:00 [hobbitd] <defunct>
hobbit   26648  0.0  0.0     0    0 ?        Z    11:36   0:00 [hobbitd] <defunct>
You're right that it is related to the checkpoint'ing - hobbitd forks a
child process to save the checkpoint file.

What I don't understand is why it isn't cleaned up afterwards. Could you
do a "ps -lw -u hobbit" ? I'm curious to see what the PPID is for these
zombies.
quoted from Michael Heinecke
This happens on Debian LINUX 3.1 'Sarge'. Analog config on different Solaris 8 SPARC Boxes => no problem.
 
On the Debian box, i have DISABLED bbdisplay in hobbitlaunch.cfg, because this box should only act as a kind of LAN probe (only bb-net and forwarding of LAN client stati to a central bbdisplay)
In that case you don't need hobbitd running at all.

Hmm - perhaps this happens because there are no messages sent to this
hobbitd instance. I think that's the cause - looking over the code it
seems that if no messages arrive, the code to clean up the child
processes is never reached.

The attached patch should fix it, although it is of course a non-issue
if you stop hobbitd on this box.

Regards,
Henrik

-------------- next part --------------
--- hobbitd/hobbitd.c	2005/09/13 08:02:50	1.183
+++ hobbitd/hobbitd.c	2005/10/25 10:48:11
@@ -3337,6 +3340,7 @@
 		 • * First attend to the housekeeping chores:
 		 * - send out our heartbeat signal;
+		 * - pick up children to avoid zombies;
 		 * - rotate logs, if we have been asked to;
 		 * - re-load the bb-hosts configuration if needed;
 		 * - check for stale status-logs that must go purple;
@@ -3358,6 +3362,9 @@
 			kill(parentpid, SIGUSR2);
 		}
 
+		/* Pickup any finished child processes to avoid zombies */
+		while (wait3(&childstat, WNOHANG, NULL) > 0) ;
• if (logfn && dologswitch) {
 			freopen(logfn, "a", stdout);
 			freopen(logfn, "a", stderr);
@@ -3666,9 +3673,6 @@
 				conntail->next = NULL;
 			}
 		}
• -		/* Pickup any finished child processes to avoid zombies */
-		while (wait3(&childstat, WNOHANG, NULL) > 0) ;
 	} while (running);
 
 	/* Tell the workers we to shutdown also */
list Michael Heinecke · Tue, 25 Oct 2005 13:52:51 +0200 ·
Hi Henrik,

here is the "ps -lw -u hobbit":

IVRP01:/opt/monitoring/hobbit/server/etc# ps -lw -u hobbit
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
1 S  1014 28281     1  0  60   0 -   380 nanosl ?        00:00:00 hobbitlaunch
0 S  1014 28282 28281  0  69   0 -   856 select ?        00:00:00 hobbitd
0 S  1014 28283 28281  0  69   0 -   384 select ?        00:00:00 bbproxy
0 S  1014 28288 28281  0  69   0 -   508 semtim ?        00:00:00 hobbitd_channel
0 S  1014 28289 28288  0  69   0 -   521 select ?        00:00:00 hobbitd_history
1 Z  1014 30216 28282  0  69   0 -     0 exit   ?        00:00:00 hobbitd <defunct>
0 S  1014 30243     1  0  77   0 -   578 wait4  ?        00:00:00 sh
0 S  1014 30246 30243  0  76   0 -   392 nanosl ?        00:00:00 vmstat


OK Henrik, now i have no hobbitd running and everything looks fine. Great!
I will also check out your patch in the next time.

:-)) Thanks! 
Michael 
quoted from Henrik Størner


-----Ursprüngliche Nachricht-----
Von: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid]
Gesendet: Dienstag, 25. Oktober 2005 12:51
An: user-ae9b8668bcde@xymon.invalid
Betreff: Re: [hobbit] increasing no. of hobbitd zombie's


On Tue, Oct 25, 2005 at 11:57:57AM +0200, user-c9ab24a501e1@xymon.invalid wrote:
Hi,
 
every "checkpoint-interval" i get a new hobbitd zombie process.
 
#> ps auxwww 
----snip ---
hobbit   25559  0.0  0.0     0    0 ?        Z    11:06   0:00 [hobbitd] <defunct>
hobbit   25917  0.0  0.0     0    0 ?        Z    11:16   0:00 [hobbitd] <defunct>
hobbit   26283  0.0  0.0     0    0 ?        Z    11:26   0:00 [hobbitd] <defunct>
hobbit   26648  0.0  0.0     0    0 ?        Z    11:36   0:00 [hobbitd] <defunct>
You're right that it is related to the checkpoint'ing - hobbitd forks a
child process to save the checkpoint file.

What I don't understand is why it isn't cleaned up afterwards. Could you
do a "ps -lw -u hobbit" ? I'm curious to see what the PPID is for these
zombies.
This happens on Debian LINUX 3.1 'Sarge'. Analog config on different Solaris 8 SPARC Boxes => no problem.
 
On the Debian box, i have DISABLED bbdisplay in hobbitlaunch.cfg, because this box should only act as a kind of LAN probe (only bb-net and forwarding of LAN client stati to a central bbdisplay)
In that case you don't need hobbitd running at all.

Hmm - perhaps this happens because there are no messages sent to this
hobbitd instance. I think that's the cause - looking over the code it
seems that if no messages arrive, the code to clean up the child
processes is never reached.

The attached patch should fix it, although it is of course a non-issue
if you stop hobbitd on this box.

Regards,
Henrik