Xymon Mailing List Archive search

hobbit_rrd stops working after about 1 hour

9 messages in this thread

list Naeem Maqsud · Thu, 18 Aug 2005 17:02:20 -0700 ·
Hi,

I'm testing out hobbit 4.1.1 for possible migration from big brother (with
bbgen). I suspected scalability issues with BB as my rrd graphs were
updated intermittently. However, hobbit is exhibiting similar problems.
After about 1 hr of restarting hobbit, the rrd graphs stop updating except
for the cpu utilization for the hobbit server itself.

The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
processors and 1GB of memory. About 800 servers are sending updates to the
hobbit server. Another 1200 servers are getting remote tests.

Load average has stayed below 1 most of the time. CPU usage has been low
with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
after the restart of hobbit server, hobbitd_rrd process stays on CPU3 with
100% utilization for the one hour that it is busy.

I hope someone can shed some light on this.

Thanks,
Naeem
list Naeem Maqsud · Mon, 22 Aug 2005 12:28:18 -0700 ·
Well, as nobody has suggested anything to my problem I guess that I'm the
only one having this issue. I have managed to find the root cause. The
hobbitd_rrd process was showing to be in "uninterruptible sleep" state most
of the time with high iowait associated with the CPU it was running on. I
suspected that the problem may be due to disk IO while updating rrds for
the 2000 hosts.
I created a tmpfs filesystem and copied the rrd directory into it. Since
then (48 hours ago) my rrd graphs have been updating continuously. I do
however need to write back to disk periodically to avoid loss of data after
a reboot.

This is OK as a temporary fix but I would like to have a permanent
solution. I would like to hear from other hobbit users who have more than
1000 hosts monitored. What type of servers and disk subsystems are they
using? Perhaps my problem is to do with RedHat and Dell server combination.
Perhaps I need to stripe over multiple spindles.

-Naeem


             Naeem                                                         
             Maqsud/SYBASE                                                 
                                                                        To 
             08/18/2005 05:02          user-ae9b8668bcde@xymon.invalid                      
             PM                                                         cc 
                                                                           
                                                                   Subject 
                                       hobbit_rrd stops working after      
                                       about 1 hour                        
quoted from Naeem Maqsud
                                                                           
                                                                           
Hi,

I'm testing out hobbit 4.1.1 for possible migration from big brother (with
bbgen). I suspected scalability issues with BB as my rrd graphs were
updated intermittently. However, hobbit is exhibiting similar problems.
After about 1 hr of restarting hobbit, the rrd graphs stop updating except
for the cpu utilization for the hobbit server itself.

The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
processors and 1GB of memory. About 800 servers are sending updates to the
hobbit server. Another 1200 servers are getting remote tests.

Load average has stayed below 1 most of the time. CPU usage has been low
with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
after the restart of hobbit server, hobbitd_rrd process stays on CPU3 with
100% utilization for the one hour that it is busy.

I hope someone can shed some light on this.

Thanks,
Naeem
list Olivier Beau · Mon, 22 Aug 2005 22:30:52 +0200 ·
Hi Naeem,

I have over 18000 rrd files being updated every 5 minutes, and havent  seen any problems with them.
i'm running hobbit on a 2x3Gh compaq server with redhat 3.0


but,
i do have heavy i/o due to hobbitd_rrd, and it is getting a problem  for me,
i'm planning to add a array card with 256M of cache in 1 or 2 days to  lower the i/o wait..
i have the feeling that hobbitd_rrd could cause performance issue for  large site and may not be fully optimized... henrik ?


concerning your problem, i posted this early this month :
"hobbitd just slows down dramaticly, causing bbtest's results  transmition to take over 250s instead of 20s;
the rrd files aren't being updated anymore and some requests to cgi's  are saying event is not available..
notifications are being sent though and external scripts don't seem  to be affected

doing a stop/start of hobbit solved the problem right now."

this happened twice for me; bbtest went yellow, i got called and  restarted hobbit..


is everything nice and green for your bigbrother server itself   (bbtest,bbgen,hobbitd) ?
have their timing execution really changed before and after the  problem ?
do you have any interesting logs ?
are the graphs for the bigbrother server itselft with "holes" ? (or  the first server in your bb-hosts file)


--
Olivier Beau
quoted from Naeem Maqsud


Le 22 août 05 à 21:28, user-04efa50aa241@xymon.invalid a écrit :
Well, as nobody has suggested anything to my problem I guess that  I'm the
only one having this issue. I have managed to find the root cause. The
hobbitd_rrd process was showing to be in "uninterruptible sleep"  state most
of the time with high iowait associated with the CPU it was running  on. I
suspected that the problem may be due to disk IO while updating  rrds for
the 2000 hosts.
I created a tmpfs filesystem and copied the rrd directory into it.  Since
then (48 hours ago) my rrd graphs have been updating continuously.  I do
however need to write back to disk periodically to avoid loss of  data after
a reboot.

This is OK as a temporary fix but I would like to have a permanent
solution. I would like to hear from other hobbit users who have  more than
1000 hosts monitored. What type of servers and disk subsystems are  they
using? Perhaps my problem is to do with RedHat and Dell server  combination.
Perhaps I need to stripe over multiple spindles.

-Naeem


             Naeem
             Maqsud/SYBASE
                                                                         To
             08/18/2005 05:02          user-ae9b8668bcde@xymon.invalid
              PM                                                         cc

                                                                    Subject
                                       hobbit_rrd stops working after
                                       about 1 hour


Hi,

I'm testing out hobbit 4.1.1 for possible migration from big  brother (with
bbgen). I suspected scalability issues with BB as my rrd graphs were
updated intermittently. However, hobbit is exhibiting similar  problems.
After about 1 hr of restarting hobbit, the rrd graphs stop updating  except
for the cpu utilization for the hobbit server itself.

The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4  GHz Xeon
processors and 1GB of memory. About 800 servers are sending updates  to the
hobbit server. Another 1200 servers are getting remote tests.

Load average has stayed below 1 most of the time. CPU usage has  been low
with 75% idle. 4 CPUs show up due to hyperthreading and I've  noticed that
after the restart of hobbit server, hobbitd_rrd process stays on  CPU3 with
100% utilization for the one hour that it is busy.

I hope someone can shed some light on this.

Thanks,
Naeem

list Henrik Størner · Mon, 22 Aug 2005 22:56:43 +0200 ·
quoted from Olivier Beau
Naeem.Maqsud wrote:
hobbitd_rrd process was showing to be in "uninterruptible sleep" state
most of the time with high iowait associated with the CPU it was running 
on.  I suspected that the problem may be due to disk IO while updating 
rrds for the 2000 hosts.  I created a tmpfs filesystem and copied the 
rrd directory into it.
You might want to look at your disk hardware and the software setup. 
I have 2000 hosts myself, with a total of just over 18000 RRD-files that
are updated every 5 minutes. vmstat tells me this system spends about
15% of its time in I/O wait.

This is a Debian/Linux system on Sun hardware - two SCSI disks in a
raid-1 config (Linux software raid mirror) with a reiserfs filesystem.
quoted from Olivier Beau


On Mon, Aug 22, 2005 at 10:30:52PM +0200, Olivier Beau wrote:
i have the feeling that hobbitd_rrd could cause performance issue for  
large site and may not be fully optimized... henrik ?
It might become a problem - I agree with that. 

The solutions are probably going to be those that you would with any
kind of application that has a high I/O load. E.g. mail- and
news-servers face similar problems. So your choice of filesystem
and mount-options become important.

For Linux systems you'd definitely want to use one of the better
performing filesystems, e.g. Reiserfs or JFS. ext2/3 - in my experience,
there are tons of benchmarks pointing in whatever direction you like - 
is slower. Using the "noatime,nodiratime" mount options is also
recommended, as is the reiserfs "notail" option.

For Solaris ufs filesystems, I am told the "journal" option will boost
performance significantly, although I have never tried it myself.


Since all of the hobbitd_rrd disk activity is done by the rrdtool
library there's not a whole lot the Hobbit can do to boost throughput - 
at least not as long as we stick to with rrdtool as the back-end for
graphs. And I have no intention of changing that.


Regards,
Henrik
list Olivier Beau · Tue, 23 Aug 2005 08:28:46 +0200 ·
Hi,

it happened a third time for me this night (3 times in 3 weeks) : syptoms: hobbitd seems to slow down and stops graphing.


i think Naeem and me are hitting a bug.


i've looked closer this night, and i saw that hobbitd_rrd was running at 100% on
the cpu it was on; i tried to strace the procees, but strace wouldnt give me any ouptut !
i finally killed hobbitd_rrd, and everything went back to normal.
hobbitd.log has : Task rrdstatus terminated, status 1
rrd_status.log has : Worker process died with exit code 1, terminating


during normal running, vmstat shows a i/o wait of 25%
my problems happened always at night, exactly at the time legato starts


-> something strange is happening whith hobbitd_rrd when the server is under
very heavy i/o..


henrik, could this be a OS issue or more a hobbitd_rrd problem ?


Olivier


Selon user-04efa50aa241@xymon.invalid:
quoted from Olivier Beau
Well, as nobody has suggested anything to my problem I guess that I'm the
only one having this issue. I have managed to find the root cause. The
hobbitd_rrd process was showing to be in "uninterruptible sleep" state most
of the time with high iowait associated with the CPU it was running on. I
suspected that the problem may be due to disk IO while updating rrds for
the 2000 hosts.
I created a tmpfs filesystem and copied the rrd directory into it. Since
then (48 hours ago) my rrd graphs have been updating continuously. I do
however need to write back to disk periodically to avoid loss of data after
a reboot.

This is OK as a temporary fix but I would like to have a permanent
solution. I would like to hear from other hobbit users who have more than
1000 hosts monitored. What type of servers and disk subsystems are they
using? Perhaps my problem is to do with RedHat and Dell server combination.
Perhaps I need to stripe over multiple spindles.

-Naeem


                                                                                        Naeem                                                                      Maqsud/SYBASE                                                                                                                         To              08/18/2005 05:02          user-ae9b8668bcde@xymon.invalid                                   PM                                                         cc                                                                                                                                               Subject                                        hobbit_rrd stops working after                                             about 1 hour                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          


Hi,

I'm testing out hobbit 4.1.1 for possible migration from big brother (with
bbgen). I suspected scalability issues with BB as my rrd graphs were
updated intermittently. However, hobbit is exhibiting similar problems.
After about 1 hr of restarting hobbit, the rrd graphs stop updating except
for the cpu utilization for the hobbit server itself.

The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
processors and 1GB of memory. About 800 servers are sending updates to the
hobbit server. Another 1200 servers are getting remote tests.

Load average has stayed below 1 most of the time. CPU usage has been low
with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
after the restart of hobbit server, hobbitd_rrd process stays on CPU3 with
100% utilization for the one hour that it is busy.

I hope someone can shed some light on this.

Thanks,
Naeem

--

Olivier Beau
list Henrik Størner · Tue, 23 Aug 2005 09:16:00 +0200 ·
quoted from Olivier Beau
On Tue, Aug 23, 2005 at 08:28:46AM +0200, Olivier Beau wrote:
i've looked closer this night, and i saw that hobbitd_rrd was running at 100% on
the cpu it was on; i tried to strace the procees, but strace wouldnt give me any ouptut !
i finally killed hobbitd_rrd, and everything went back to normal.
Your description sounds as if hobbitd_rrd goes into some loop.  That would explain why strace shows nothing (strace only shows system call activity - if an application is looping in user-mode it won't show any system call activity).

The next time it happens, could you kill it with "kill -ABRT" ? That should cause it to dump core, and that might give a clue as to where
it is looping.
henrik, could this be a OS issue or more a hobbitd_rrd problem ?
hobbitd_rrd, I'm afraid. 

Regards,
Henrik
list Naeem Maqsud · Tue, 23 Aug 2005 11:34:51 -0700 ·
Olivier,

Why don't you try the  approach of putting your rrd files in a tmpfs
filesystem? This seems to have resolved my rrd problem. At least you can
try to see if this resolves your issue and then you know for sure it is
related to disk IO. This is what I did:

1. mkdir /usr/local/bbvar/rrd_orig
2. mv /usr/local/bbvar/rrd /usr/local/bbvar/rrd_orig
3. mkdir /usr/local/bbvar/rrd
4. Add the following line to /etc/fstab:
      tmpfs /usr/local/bbvar/rrd tmpfs mode=755,rw,size=2G 0 0

5. mount /usr/local/bbvar/rrd; chown <id of bb user> /usr/local/bbvar/rrd

6. cp -pr /usr/local/bbvar/rrd_orig/rrd/* /usr/local/bbvar/rrd

7. Start hobbit

If you want to keep this as a permanent solution, then you will need to
setup a cronjob to periodically copy the rrd files from the tmpfs
filesystem back to disk. This is because if you unmount the tmpfs FS all
data will be lost. You can put a line in crontab as shown below to run at
8:30 PM daily:

      30 20 * * *  rsync -av /usr/local/bbvar/rrd /usr/local/bbvar/rrd_orig

Remember that everytime you reboot, you will need to copy the files from
disk to the tmpfs filesystem. You can put a line in /etc/rc.local to do
this for you.

Hope this helps.

-Naeem


             Olivier Beau                                                  
             <olivier at qalpit.c                                             
             om>                                                        To 
                                       user-ae9b8668bcde@xymon.invalid                      
             08/22/2005 11:28                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: [hobbit] Re: hobbit_rrd stops   
             Please respond to         working after about 1 hour          
              user-ae9b8668bcde@xymon.invalid                                               
quoted from Olivier Beau
                                                                           
                                                                           
Hi,

it happened a third time for me this night (3 times in 3 weeks) :
syptoms: hobbitd seems to slow down and stops graphing.


i think Naeem and me are hitting a bug.


i've looked closer this night, and i saw that hobbitd_rrd was running at
100% on
the cpu it was on;
i tried to strace the procees, but strace wouldnt give me any ouptut !
i finally killed hobbitd_rrd, and everything went back to normal.
hobbitd.log has : Task rrdstatus terminated, status 1
rrd_status.log has : Worker process died with exit code 1, terminating


during normal running, vmstat shows a i/o wait of 25%
my problems happened always at night, exactly at the time legato starts


-> something strange is happening whith hobbitd_rrd when the server is
under
very heavy i/o..


henrik, could this be a OS issue or more a hobbitd_rrd problem ?


Olivier


Selon user-04efa50aa241@xymon.invalid:
Well, as nobody has suggested anything to my problem I guess that I'm the
only one having this issue. I have managed to find the root cause. The
hobbitd_rrd process was showing to be in "uninterruptible sleep" state
most
of the time with high iowait associated with the CPU it was running on. I
suspected that the problem may be due to disk IO while updating rrds for
the 2000 hosts.
I created a tmpfs filesystem and copied the rrd directory into it. Since
then (48 hours ago) my rrd graphs have been updating continuously. I do
however need to write back to disk periodically to avoid loss of data
after
a reboot.

This is OK as a temporary fix but I would like to have a permanent
solution. I would like to hear from other hobbit users who have more than
1000 hosts monitored. What type of servers and disk subsystems are they
using? Perhaps my problem is to do with RedHat and Dell server
combination.
Perhaps I need to stripe over multiple spindles.

-Naeem


             Naeem
             Maqsud/SYBASE
To
             08/18/2005 05:02          user-ae9b8668bcde@xymon.invalid
             PM
cc
Subject
                                       hobbit_rrd stops working after
                                       about 1 hour
Hi,

I'm testing out hobbit 4.1.1 for possible migration from big brother
(with
bbgen). I suspected scalability issues with BB as my rrd graphs were
updated intermittently. However, hobbit is exhibiting similar problems.
After about 1 hr of restarting hobbit, the rrd graphs stop updating
except
for the cpu utilization for the hobbit server itself.

The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
processors and 1GB of memory. About 800 servers are sending updates to
the
hobbit server. Another 1200 servers are getting remote tests.

Load average has stayed below 1 most of the time. CPU usage has been low
with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
after the restart of hobbit server, hobbitd_rrd process stays on CPU3
with
100% utilization for the one hour that it is busy.

I hope someone can shed some light on this.

Thanks,
Naeem

--
Olivier Beau
list Terry Rossi · 24 Aug 2005 23:53:12 GMT ·
Olivier,

Why don't you try the  approach of putting your rrd files in a tmpfs
filesystem? This seems to have resolved my rrd problem. At least you can
try to see if this resolves your issue and then you know for sure it is
related to disk IO. This is what I did:

1. mkdir /usr/local/bbvar/rrd_orig
2. mv /usr/local/bbvar/rrd /usr/local/bbvar/rrd_orig
3. mkdir /usr/local/bbvar/rrd
4. Add the following line to /etc/fstab:
      tmpfs /usr/local/bbvar/rrd tmpfs mode=755,rw,size=2G 0 0

5. mount /usr/local/bbvar/rrd; chown <id of bb user> /usr/local/bbvar/rrd

6. cp -pr /usr/local/bbvar/rrd_orig/rrd/* /usr/local/bbvar/rrd

7. Start hobbit

If you want to keep this as a permanent solution, then you will need to
setup a cronjob to periodically copy the rrd files from the tmpfs
filesystem back to disk. This is because if you unmount the tmpfs FS all
data will be lost. You can put a line in crontab as shown below to run at
8:30 PM daily:

      30 20 * * *  rsync -av /usr/local/bbvar/rrd /usr/local/bbvar/rrd_orig

Remember that everytime you reboot, you will need to copy the files from
disk to the tmpfs filesystem. You can put a line in /etc/rc.local to do
this for you.

Hope this helps.

-Naeem


             Olivier Beau                                                  
             <olivier at qalpit.c                                             
             om>                                                        To 
                                       user-ae9b8668bcde@xymon.invalid                      
             08/22/2005 11:28                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: [hobbit] Re: hobbit_rrd stops   
             Please respond to         working after about 1 hour          
              user-ae9b8668bcde@xymon.invalid                                               
                                                                           
                                                                           
Hi,

it happened a third time for me this night (3 times in 3 weeks) :
syptoms: hobbitd seems to slow down and stops graphing.


i think Naeem and me are hitting a bug.


i've looked closer this night, and i saw that hobbitd_rrd was running at
100% on
the cpu it was on;
i tried to strace the procees, but strace wouldnt give me any ouptut !
i finally killed hobbitd_rrd, and everything went back to normal.
hobbitd.log has : Task rrdstatus terminated, status 1
rrd_status.log has : Worker process died with exit code 1, terminating


during normal running, vmstat shows a i/o wait of 25%
my problems happened always at night, exactly at the time legato starts


-> something strange is happening whith hobbitd_rrd when the server is
under
very heavy i/o..


henrik, could this be a OS issue or more a hobbitd_rrd problem ?


Olivier


Selon user-04efa50aa241@xymon.invalid:
Well, as nobody has suggested anything to my problem I guess that I'm the
only one having this issue. I have managed to find the root cause. The
hobbitd_rrd process was showing to be in "uninterruptible sleep" state
most
of the time with high iowait associated with the CPU it was running on. I
suspected that the problem may be due to disk IO while updating rrds for
the 2000 hosts.
I created a tmpfs filesystem and copied the rrd directory into it. Since
then (48 hours ago) my rrd graphs have been updating continuously. I do
however need to write back to disk periodically to avoid loss of data
after
a reboot.

This is OK as a temporary fix but I would like to have a permanent
solution. I would like to hear from other hobbit users who have more than
1000 hosts monitored. What type of servers and disk subsystems are they
using? Perhaps my problem is to do with RedHat and Dell server
combination.
Perhaps I need to stripe over multiple spindles.

-Naeem


             Naeem
             Maqsud/SYBASE
To
             08/18/2005 05:02          user-ae9b8668bcde@xymon.invalid
             PM
cc
Subject
                                       hobbit_rrd stops working after
                                       about 1 hour
Hi,

I'm testing out hobbit 4.1.1 for possible migration from big brother
(with
bbgen). I suspected scalability issues with BB as my rrd graphs were
updated intermittently. However, hobbit is exhibiting similar problems.
After about 1 hr of restarting hobbit, the rrd graphs stop updating
except
for the cpu utilization for the hobbit server itself.

The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
processors and 1GB of memory. About 800 servers are sending updates to
the
hobbit server. Another 1200 servers are getting remote tests.

Load average has stayed below 1 most of the time. CPU usage has been low
with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
after the restart of hobbit server, hobbitd_rrd process stays on CPU3
with
100% utilization for the one hour that it is busy.

I hope someone can shed some light on this.

Thanks,
Naeem

--
Olivier Beau
list Terry Rossi · 25 Aug 2005 00:24:06 GMT ·
Hi,

I'm testing out hobbit 4.1.1 for possible migration from big brother (with
bbgen). I suspected scalability issues with BB as my rrd graphs were
updated intermittently. However, hobbit is exhibiting similar problems.
After about 1 hr of restarting hobbit, the rrd graphs stop updating except
for the cpu utilization for the hobbit server itself.

The hobbit server is running RedHat Linux AS 3.0. It has 2 x 2.4 GHz Xeon
processors and 1GB of memory. About 800 servers are sending updates to the
hobbit server. Another 1200 servers are getting remote tests.

Load average has stayed below 1 most of the time. CPU usage has been low
with 75% idle. 4 CPUs show up due to hyperthreading and I've noticed that
after the restart of hobbit server, hobbitd_rrd process stays on CPU3 with
100% utilization for the one hour that it is busy.

I hope someone can shed some light on this.

Thanks,
Naeem