Xymon Mailing List Archive search

hobbit_rrd stops working after about 1 hour

list Olivier Beau
Tue, 23 Aug 2005 08:28:46 +0200
Message-Id: <user-c59f2f7ab3b5@xymon.invalid>

Hi,

it happened a third time for me this night (3 times in 3 weeks) : syptoms: hobbitd seems to slow down and stops graphing.


i think Naeem and me are hitting a bug.


i've looked closer this night, and i saw that hobbitd_rrd was running at 100% on
the cpu it was on; i tried to strace the procees, but strace wouldnt give me any ouptut !
i finally killed hobbitd_rrd, and everything went back to normal.
hobbitd.log has : Task rrdstatus terminated, status 1
rrd_status.log has : Worker process died with exit code 1, terminating


during normal running, vmstat shows a i/o wait of 25%
my problems happened always at night, exactly at the time legato starts


-> something strange is happening whith hobbitd_rrd when the server is under
very heavy i/o..


henrik, could this be a OS issue or more a hobbitd_rrd problem ?


Olivier


Selon user-04efa50aa241@xymon.invalid:
Well, as nobody has suggested anything to my problem I guess that I'm the
only one having this issue. I have managed to find the root cause. The
hobbitd_rrd process was showing to be in "uninterruptible sleep" state most
of the time with high iowait associated with the CPU it was running on. I
suspected that the problem may be due to disk IO while updating rrds for
the 2000 hosts.
I created a tmpfs filesystem and copied the rrd directory into it. Since
then (48 hours ago) my rrd graphs have been updating continuously. I do
however need to write back to disk periodically to avoid loss of data after
a reboot.

This is OK as a temporary fix but I would like to have a permanent
solution. I would like to hear from other hobbit users who have more than
1000 hosts monitored. What type of servers and disk subsystems are they
using? Perhaps my problem is to do with RedHat and Dell server combination.
Perhaps I need to stripe over multiple spindles.

-Naeem


                                                                                        Naeem                                                                      Maqsud/SYBASE                                                                                                                         To              08/18/2005 05:02          user-ae9b8668bcde@xymon.invalid                                   PM                                                         cc                                                                                                                                               Subject                                        hobbit_rrd stops working after                                             about 1 hour                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          


Hi,

I'm testing out hobbit 4.1.1 for possible migration from big brother (with
bbgen). I suspected scalability issues with BB as my rrd graphs were
updated intermittently. However, hobbit is exhibiting similar problems.
After about 1 hr of restarting hobbit, the rrd graphs stop updating except
for the cpu utilization for the hobbit server itself.

The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
processors and 1GB of memory. About 800 servers are sending updates to the
hobbit server. Another 1200 servers are getting remote tests.

Load average has stayed below 1 most of the time. CPU usage has been low
with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
after the restart of hobbit server, hobbitd_rrd process stays on CPU3 with
100% utilization for the one hour that it is busy.

I hope someone can shed some light on this.

Thanks,
Naeem

--
Olivier Beau