New Hobbit stuff: Scalability and H/A work

10 messages in this thread

list Henrik Størner · Tue, 31 Oct 2006 17:03:34 +0100 ·

A couple of weeks ago, I was asked if our Hobbit system at work could
handle monitoring of one more customer. Of course, I said - no problem.
Well, there was one gotcha: This customer has 1100+ servers that need 
to be monitored. Which means my Hobbit installation is about to double 
in the number of hosts monitored. Hmm ...

This will be interesting to watch. I am fairly confident that Hobbit can
handle it, with one exception: The disks on my Hobbit server will be
overloaded. It already spends about 50% of it's time in I/O wait, so
doubling the number of hosts with cpu/memory/disk etc. graphs will
probably crash it.

So something needs to be done - fast: These hosts should go into our
Hobbit before Christmas. That is why I currently may seem a bit absent 
from the mailing list.

The way I plan to handle it will be by distributing the load of RRD
updates onto several servers, each handling a subset of the total set of
hosts; Hobbit will automatically detect which of the 3 RRD-servers
handles a specific host and direct rrd-updates to that server. New
hosts are distributed across all of the RRD servers in a weighted 
round-robin fashion.

The way this is going to be done means that it can be used not only for
distributing the load of the RRD file updates, but also for distributing
the other hobbitd_* modules (alerting, history logs, client data
processing etc). In other words, this will be a major win for Hobbit in
large installations.

It also has one more benefit: I think this can be evolved to handle
automatic failover, so you can run multiple Hobbit servers that process
the same data - meaning all of the on-disk data will be identical across all
of the Hobbit servers. This should make it possible to setup a group of
Hobbit servers for very high availability of the monitoring system. I 
haven't worked out all of the implementation details yet, but I think it
is possible.


Regards,
Henrik

list Gildas le Nadan · Tue, 31 Oct 2006 16:48:34 +0000 ·

▸ quoted from Henrik Størner

Henrik Stoerner wrote:

A couple of weeks ago, I was asked if our Hobbit system at work could
handle monitoring of one more customer. Of course, I said - no problem.
Well, there was one gotcha: This customer has 1100+ servers that need to be monitored. Which means my Hobbit installation is about to double in the number of hosts monitored. Hmm ...

This will be interesting to watch. I am fairly confident that Hobbit can
handle it, with one exception: The disks on my Hobbit server will be
overloaded. It already spends about 50% of it's time in I/O wait, so
doubling the number of hosts with cpu/memory/disk etc. graphs will
probably crash it.

Strange that I/O seem to be an issue for you. What kind of system do you run the hobbit server on?

I have ~3300 rrd files updated here on a blade with an old 40 Gig 5400 2,5" hard drive and it is almost idle.

Cheers,
Gildas

list Pnixon · Tue, 31 Oct 2006 11:53:10 -0500 ·

While off topic, but how do you guys measure your IO to disk?

--Pat

▸ quoted from Gildas le Nadan

-----Original Message-----
From: Gildas Le Nadan [mailto:user-231cb1cfd8a8@xymon.invalid] Sent: Tuesday, October 31, 2006 11:49 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] New Hobbit stuff: Scalability and H/A work

Henrik Stoerner wrote:

A couple of weeks ago, I was asked if our Hobbit system at work could handle monitoring of one more customer. Of course, I said - no problem.
Well, there was one gotcha: This customer has 1100+ servers that need to be monitored. Which means my Hobbit installation is about to double in the number of hosts monitored. Hmm ...

This will be interesting to watch. I am fairly confident that Hobbit can handle it, with one exception: The disks on my Hobbit server will be overloaded. It already spends about 50% of it's time in I/O wait, so doubling the number of hosts with cpu/memory/disk etc. graphs will probably crash it.

Strange that I/O seem to be an issue for you. What kind of system do you run
the hobbit server on?

I have ~3300 rrd files updated here on a blade with an old 40 Gig 5400 2,5"
hard drive and it is almost idle.

Cheers,
Gildas

list Gildas le Nadan · Tue, 31 Oct 2006 16:57:41 +0000 ·

▸ quoted from Pnixon

user-c102b8958c7a@xymon.invalid wrote:

While off topic, but how do you guys measure your IO to disk?

--Pat

I use vmstat or iostat

Gildas

list Henrik Størner · Tue, 31 Oct 2006 18:05:55 +0100 ·

▸ quoted from Gildas le Nadan

On Tue, Oct 31, 2006 at 04:48:34PM +0000, Gildas Le Nadan wrote:

Henrik Stoerner wrote:

This will be interesting to watch. I am fairly confident that Hobbit can
handle it, with one exception: The disks on my Hobbit server will be
overloaded. It already spends about 50% of it's time in I/O wait, so
doubling the number of hosts with cpu/memory/disk etc. graphs will
probably crash it.

Strange that I/O seem to be an issue for you. What kind of system do you 
run the hobbit server on?

The server is an oldish Sun E220 with two new 72 GB SCSI disks, 10K rpms.

▸ quoted from Pnixon

I have ~3300 rrd files updated here on a blade with an old 40 Gig 5400 2,5" 
hard drive and it is almost idle.

I have about 25000 RRD files :-) That's about 80 updates every second.


Regards,
Henrik

list Francesco Duranti · Tue, 31 Oct 2006 18:08:37 +0100 ·

It will be a really nice addition :D

If it will also allow the creation of custom program for rrd creation
with some config file/Makefile to link with hobbit library it will be
fantastic :D

Francesco

▸ quoted from Henrik Størner

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] Sent: Tuesday, October 31, 2006 5:04 PM
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] New Hobbit stuff: Scalability and H/A work

A couple of weeks ago, I was asked if our Hobbit system at work could handle monitoring of one more customer. Of course, I said - no problem.
Well, there was one gotcha: This customer has 1100+ servers that need to be monitored. Which means my Hobbit installation is about to double in the number of hosts monitored. Hmm ...

This will be interesting to watch. I am fairly confident that Hobbit can handle it, with one exception: The disks on my Hobbit server will be overloaded. It already spends about 50% of it's time in I/O wait, so doubling the number of hosts with cpu/memory/disk etc. graphs will probably crash it.

So something needs to be done - fast: These hosts should go into our Hobbit before Christmas. That is why I currently may seem a bit absent from the mailing list.

The way I plan to handle it will be by distributing the load of RRD updates onto several servers, each handling a subset of the total set of hosts; Hobbit will automatically detect which of the 3 RRD-servers handles a specific host and direct rrd-updates to that server. New hosts are distributed across all of the RRD servers in a weighted round-robin fashion.

The way this is going to be done means that it can be used not only for distributing the load of the RRD file updates, but also for distributing the other hobbitd_* modules (alerting, history logs, client data processing etc). In other words, this will be a major win for Hobbit in large installations.

It also has one more benefit: I think this can be evolved to handle automatic failover, so you can run multiple Hobbit servers that process the same data - meaning all of the on-disk data will be identical across all of the Hobbit servers. This should make it possible to setup a group of Hobbit servers for very high availability of the monitoring system. I haven't worked out all of the implementation details yet, but I think it is possible.

Regards,
Henrik

list T.J. Yang · Tue, 31 Oct 2006 12:15:41 -0600 ·

▸ quoted from Francesco Duranti

From: user-ce4a2c883f75@xymon.invalid (Henrik Stoerner)
Reply-To: user-ae9b8668bcde@xymon.invalid
To: user-ae9b8668bcde@xymon.invalid
Subject: [hobbit] New Hobbit stuff: Scalability and H/A work
Date: Tue, 31 Oct 2006 17:03:34 +0100

A couple of weeks ago, I was asked if our Hobbit system at work could
handle monitoring of one more customer. Of course, I said - no problem.
Well, there was one gotcha: This customer has 1100+ servers that need
to be monitored. Which means my Hobbit installation is about to double
in the number of hosts monitored. Hmm ...

This will be interesting to watch. I am fairly confident that Hobbit can
handle it, with one exception: The disks on my Hobbit server will be
overloaded. It already spends about 50% of it's time in I/O wait, so
doubling the number of hosts with cpu/memory/disk etc. graphs will
probably crash it.

So something needs to be done - fast: These hosts should go into our
Hobbit before Christmas. That is why I currently may seem a bit absent
from the mailing list.

The way I plan to handle it will be by distributing the load of RRD
updates onto several servers, each handling a subset of the total set of
hosts; Hobbit will automatically detect which of the 3 RRD-servers
handles a specific host and direct rrd-updates to that server. New
hosts are distributed across all of the RRD servers in a weighted
round-robin fashion.

So this is load balancing, each hobbit server keep a portion all-hosts's rrd 
files.

▸ quoted from Francesco Duranti

The way this is going to be done means that it can be used not only for
distributing the load of the RRD file updates, but also for distributing
the other hobbitd_* modules (alerting, history logs, client data
processing etc). In other words, this will be a major win for Hobbit in
large installations.

It also has one more benefit: I think this can be evolved to handle
automatic failover, so you can run multiple Hobbit servers that process
the same data - meaning all of the on-disk data will be identical across 
all
of the Hobbit servers. This should make it possible to setup a group of
Hobbit servers for very high availability of the monitoring system. I
haven't worked out all of the implementation details yet, but I think it
is possible.

Please check out FTSha(R1), the opensource cluster/ha for Solaris.
So the only part you need to work on is the "Distributed BB messages 
distribution".

R1: http://www.fstha.com/compare.html

Regards

tj


Regards,
Henrik

Stay in touch with old friends and meet new ones with Windows Live Spaces 
http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mkt=en-us

list Beau Olivier · Thu, 2 Nov 2006 08:51:20 +0100 ·

oh oh oh !!! loook verrry interesting :)))

i have 2 questions:
-if the heavy work (rrd updates) is distributed on 2 or more dedicated "rrd servers",
the data wont be redondant on each of these "rrd server", right ?
-what would happen if an "rrd server" crashes ??


recently i changed my 2 hobbit servers, which now have 2G of ram each
would it make sense to have hobbit evolve and take full advantage of that memory ?


olivier

▸ quoted from T.J. Yang



-----Message d'origine-----
De : Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid]
Envoyé : mardi 31 octobre 2006 17:04
À : user-ae9b8668bcde@xymon.invalid
Objet : [hobbit] New Hobbit stuff: Scalability and H/A work


A couple of weeks ago, I was asked if our Hobbit system at work could
handle monitoring of one more customer. Of course, I said - no problem.
Well, there was one gotcha: This customer has 1100+ servers that need 
to be monitored. Which means my Hobbit installation is about to double 
in the number of hosts monitored. Hmm ...

This will be interesting to watch. I am fairly confident that Hobbit can
handle it, with one exception: The disks on my Hobbit server will be
overloaded. It already spends about 50% of it's time in I/O wait, so
doubling the number of hosts with cpu/memory/disk etc. graphs will
probably crash it.

So something needs to be done - fast: These hosts should go into our
Hobbit before Christmas. That is why I currently may seem a bit absent 
from the mailing list.

The way I plan to handle it will be by distributing the load of RRD
updates onto several servers, each handling a subset of the total set of
hosts; Hobbit will automatically detect which of the 3 RRD-servers
handles a specific host and direct rrd-updates to that server. New
hosts are distributed across all of the RRD servers in a weighted 
round-robin fashion.

The way this is going to be done means that it can be used not only for
distributing the load of the RRD file updates, but also for distributing
the other hobbitd_* modules (alerting, history logs, client data
processing etc). In other words, this will be a major win for Hobbit in
large installations.

It also has one more benefit: I think this can be evolved to handle
automatic failover, so you can run multiple Hobbit servers that process
the same data - meaning all of the on-disk data will be identical across all
of the Hobbit servers. This should make it possible to setup a group of
Hobbit servers for very high availability of the monitoring system. I 
haven't worked out all of the implementation details yet, but I think it
is possible.


Regards,
Henrik

list Shawn Saunders · Thu, 02 Nov 2006 09:18:51 -0800 ·

I have created an 'ext' script that is set in the clientlaunch.cfg to run on an interval of 7 days "INTERVAL 7d"

After it has run, going GREEN as it should, in about 15 minutes it goes purple, and will go Red shortly thereafter.  But it is not scheduled to run for 7 days.

I even tried putting an entry on the server-side:
 hobbit-clients.cfg:     sysdb   8d

This to no avail. 
What should I do to get it to only have this go red or purple if it fails on the next time it is run?  It appears that every time the hobbit client reports, that it views this weekly scheduled event as a failure. 
-- 

Shawn Saunders MIST Team member
Jet Propulsion Laboratory XXXX Oak Grove Drive Pasadena, CA XXXXX MS XXX-310 Office: (XXX) XXX-XXXX / 230-310A Mobile: (XXX) XXX-XXXX
FAX: (XXX) XXX-XXXX

Any fool can make things bigger, more complex, and more violent. It takes a touch of genius-and a lot of courage-to move in the opposite direction.
Albert Einstein

list Rich Smrcina · Thu, 02 Nov 2006 15:05:40 -0600 ·

When you send your status message, use 'status+7d'.

The exact syntax is on the man page for the bb command.

▸ quoted from Shawn Saunders


Shawn Saunders wrote:

I have created an 'ext' script that is set in the clientlaunch.cfg to run on an interval of 7 days "INTERVAL 7d"

After it has run, going GREEN as it should, in about 15 minutes it goes purple, and will go Red shortly thereafter.  But it is not scheduled to run for 7 days.

I even tried putting an entry on the server-side:
hobbit-clients.cfg:     sysdb   8d

This to no avail.
What should I do to get it to only have this go red or purple if it fails on the next time it is run?  It appears that every time the hobbit client reports, that it views this weekly scheduled event as a failure.

--


Rich Smrcina
VM Assist, Inc.
Phone: XXX-XXX-XXXX
Ans Service:  XXX-XXX-XXXX
user-61add9955ef9@xymon.invalid

Catch the WAVV!  http://www.wavv.org
WAVV 2007 - Green Bay, WI - May 18-22, 2007

New Hobbit stuff: Scalability and H/A work 🔗 link

New Hobbit stuff: Scalability and H/A work