Xymon Mailing List Archive search

Loadbalancing Hobbit Server

7 messages in this thread

list Ram Prasad · Mon, 12 Feb 2007 14:02:16 -0500 ·
Hi

I would like to implement hobbit collecting data from lot of hosts.
Considering that, is there a way to load balance or distribute the
monitoring across multiple hobbit servers ?

- Ram
list Henrik Størner · Mon, 12 Feb 2007 22:22:44 +0100 ·
quoted from Ram Prasad
On Mon, Feb 12, 2007 at 02:02:16PM -0500, Prasad, Ram (GE, Corporate, consultant) wrote:
I would like to implement hobbit collecting data from lot of hosts.
Considering that, is there a way to load balance or distribute the
monitoring across multiple hobbit servers ?
Not in the current (4.2.0) version, but some work has been done for the
next release. Hobbit *is* designed to have only one server that has full
knowledge of the current status of each system being monitored, but you
can distribute various components of Hobbit onto other servers:

* Network tests can be distributed among multiple servers - typically
  you will have a network test server that handles a specific part of
  your network topology. E.g. one network test server for the DMZ
  systems, and another for your internal-only servers. This is available
  in 4.2.0.

* History logs and RRD files - i.e. the Hobbit modules that need to
  store data on disk - can be distributed among multiple servers.
  Hobbit will automatically send updates to the correct server, and
  fetch data from the server holding it when generating webpages.
  This also applies to the client-data logs that are stored when
  a critical event occurs. (4.3.0)

* Client data analysis can be performed on any server. Hobbit can
  feed the client data to any server capable of doing this. (4.3.0)

BTW, the stuff listed as present in the next (4.3.0) release is already
in the current snapshot, and at least the RRD load balancing is in
production use now.

But before you start planning to deploy large farms of Hobbit servers to
handle all of your hosts, do a trial installation and see what the load
on your Hobbit servers will be. At work, I have single servers handling 
over 3000 hosts - and a 5 year old Hobbit server handles it just fine.
The only task we've split off is the RRD updates, since the disk system
on our Hobbit server couldn't handle the flood of RRD file updates (we
have about 25000 RRD files being updated every 5 minutes).


Regards,
Henrik
list Scott Walters · Mon, 19 Feb 2007 15:41:47 -0500 ·
quoted from Henrik Størner
* History logs and RRD files - i.e. the Hobbit modules that need to
store data on disk - can be distributed among multiple servers.
Hobbit will automatically send updates to the correct server, and
fetch data from the server holding it when generating webpages.
This also applies to the client-data logs that are stored when
a critical event occurs. (4.3.0)

Is this leading the way to a hot-cold or hot-hot HA setup?  I understand how
one server could distribute jobs to a farm.  But what if the central server
goes down?  If data is sync'ed throughout the environement how could the
'freshest' be guaranteed through failures?

I am looking to implement a HA Hobbit solution where 10 minutes of recovery
is acceptable while preserving historical data.

Are you just writing 'load sharing logic' or do you plan on developing
failover/recovery logic as well?

Scott Walters
-PacketPusher
list Henrik Størner · Mon, 19 Feb 2007 23:33:42 +0100 ·
Hi Scott,

you're always asking interesting questions :-)
quoted from Scott Walters

On Mon, Feb 19, 2007 at 03:41:47PM -0500, Scott Walters wrote:
* History logs and RRD files - i.e. the Hobbit modules that need to
store data on disk - can be distributed among multiple servers.
Hobbit will automatically send updates to the correct server, and
fetch data from the server holding it when generating webpages.
This also applies to the client-data logs that are stored when
a critical event occurs. (4.3.0)
Is this leading the way to a hot-cold or hot-hot HA setup?  I understand how
one server could distribute jobs to a farm.  But what if the central server
goes down?  If data is sync'ed throughout the environement how could the
'freshest' be guaranteed through failures?

I am looking to implement a HA Hobbit solution where 10 minutes of recovery
is acceptable while preserving historical data.

Are you just writing 'load sharing logic' or do you plan on developing
failover/recovery logic as well?
The immediate need I had was load sharing. But I believe it can be used
to implement failover as well. Let me explain.

A Hobbit server consists of one core daemon which has all of the current
state information, and a bunch of more-or-less stateless "task"
handlers. There's an "update the RRD files" task, an "analyze the client
data" task, a "send out the alerts" task, and a "run the network tests"
task. Plus some more, but you get the picture.

My plan is that you can have multiple servers running each of these
tasks, and you can duplicate the tasks so they run on multiple servers.
When each task is initialized, it tells Hobbit that "hey, I'm here and I
can do alerts" - and then it basically just goes to sleep until it is 
notified that now it should actually do something. So, whenever the
Hobbit server needs to hand off some action to a task, it checks what
servers can handle it and just picks one that is available.

The information about what servers are available for handling the
various tasks is contained in a small demon running on the Hobbit
server; think of it as a kind of "Hobbit-DNS" except that it is updated
automatically.

Some tasks can run on any of the available servers. E.g. analyzing the
client data can be done on any server running the hobbitd_client module;
so it doesn't matter which of the available "client task" servers is
invoked. (Obviously, the config files must be kept in sync on the
servers, but that's why we have tools like rsync).

Some tasks store data - e.g. the RRD files. Those tasks can run on
multiple servers, BUT: For any given host, there will be only one server
holding the data. It's no good feeding the RRD updates to server A at
10:00 AM, and server B at 10:05 - because that would break the RRD data.
So if the RRD files for "www.foo.com" lives on server A, and that server
crashes, then you will lose access to the RRD files for www.foo.com -
but RRD files for hosts on the other servers will still be available.
History logs are handled like RRD files. Now, you can argue that it
would be nice if you could replicate the RRD- or history-updates to
multiple servers so you would have a complete failover where you
wouldn't lose access to some of the data. If there's enough requests
it can be added - there's nothing in the design that prevents it. But
perhaps it would just be simpler to mirror those files between the
servers at regular intervals through some other program.

There are some tasks that can only run on one server at a time: E.g.
the "send out alerts" task is one you wouldn't want to duplicate. So 
for this type of task, Hobbit will initially pick one server to handle 
it, and only if that servers fails will it switch to another server.


So now there's a mechanism in place for having fail-over servers for the
critical tasks, and load-balancing tasks among multiple servers. The
missing piece is to duplicate the core Hobbit server, and replicate the
information that is stored there (the current state of the system, and
the what-servers-run-what-tasks info). That's the part I haven't quite
worked out yet.

Replicating the data is fairly straight-forward. Hobbit already has a
mechanism in place for saving the current state in a "checkpoint" file,
so it can restart without losing the current state info. So replication
can be done by putting in some method for requesting the checkpoint
data. Sure, you'd lose a few minutes worth of updates in case of a
failover - depending upon how often you update the standby-servers'
data from the checkpoint - but since Hobbit updates everything every 5
minutes, I don't think that will be a major issue.

The tricky part is deciding when to do the failover. My current plan is
to have a "standby" option for the backup Hobbit daemon where it just 
loads and picks up the checkpoint data from the master server at regular
intervals; once that fails it goes on-line and starts behaving like a 
regular Hobbit daemon. That would suffice for a 2-server/hot-cold setup, 
and makes matters a lot less complicated (eg I won't have to deal with 
the issue of deciding who has the most recent data).


There are still a couple of murky details, like how do you get the
clients to send their data to the server that is up? One way would be
to send them a list of the available Hobbit servers whenever they send
their client reports, so they always (except the first time) have a list
of the current servers. If sending data to the first server fails, they
must try the next server in the list - if that works, then they'll get
a new list back with the new Hobbit server as the first one to try.


Those are my ideas. Feedback is very welcome from anyone; this is a
relatively new area for me to be working with (at least from a 
programmer perspective), so any input will be appreciated.


Regards,
Henrik
list Scott Walters · Mon, 19 Feb 2007 23:36:08 -0500 ·
On 2/19/07, Henrik Stoerner <user-ce4a2c883f75@xymon.invalid> wrote:
Hi Scott,

you're always asking interesting questions :-)

Thanks.  I changed majors to Philosophy since returning to college to finish
my undergraduate degree.  I'm glad to see it's paying off ;)
quoted from Henrik Størner
Some tasks can run on any of the available servers. E.g. analyzing the
client data can be done on any server running the hobbitd_client module;
so it doesn't matter which of the available "client task" servers is
invoked. (Obviously, the config files must be kept in sync on the
servers, but that's why we have tools like rsync).

Hmmm...since you already have a "task master" it might be convenient to make
it the "config master" as well.  Similar to the hobbit-client.cfg?
quoted from Henrik Størner


Some tasks store data - e.g. the RRD files. Those tasks can run on
multiple servers, BUT: For any given host, there will be only one server
holding the data. It's no good feeding the RRD updates to server A at
10:00 AM, and server B at 10:05 - because that would break the RRD data.
So if the RRD files for "www.foo.com" lives on server A, and that server
crashes, then you will lose access to the RRD files for www.foo.com -
but RRD files for hosts on the other servers will still be available.
History logs are handled like RRD files. Now, you can argue that it
would be nice if you could replicate the RRD- or history-updates to
multiple servers so you would have a complete failover where you
wouldn't lose access to some of the data. If there's enough requests
it can be added - there's nothing in the design that prevents it. But
perhaps it would just be simpler to mirror those files between the
servers at regular intervals through some other program.

Yes, that's the million dollar question:  Should HA with integrity of
RRD/history files be part of of Hobbit?  Even if you do put the
history-updates to multiple servers, you still have the nightmare of how to
sync things up when the "dead" server comes back up.
quoted from Henrik Størner
Those are my ideas. Feedback is very welcome from anyone; this is a
relatively new area for me to be working with (at least from a
programmer perspective), so any input will be appreciated.
Because of the complexity of HA solutions and data integrity, I am not sure
the hobbit code is the right place for the logic.  Similar to the database
backend, you'll open yourself up to a lot of potential debugging.  I am a
keep it simple stupid kinda guy and I am reminded of a saying, "A man with
one watch always knows what time it is."

I'd rather see the hobbit tool improve monitoring, reports, and other
features that really matter.  Let the HA happen outside of hobbit.

I also believe you should only cluster/load-balance when one box can't do
the job.  Introducing those complexities to increase availability are
usually counterproductive -- you end up taking your system down because it's
so hard to configure/maintain.  And then it usually doesn't work anyway when
it's supposed to.

Scott Walters
-PacketPusher
list Henrik Størner · Tue, 20 Feb 2007 07:53:18 +0100 ·
quoted from Scott Walters
On Mon, Feb 19, 2007 at 11:36:08PM -0500, Scott Walters wrote:
Because of the complexity of HA solutions and data integrity, I am not sure
the hobbit code is the right place for the logic.  Similar to the database
backend, you'll open yourself up to a lot of potential debugging.  I am a
keep it simple stupid kinda guy and I am reminded of a saying, "A man with
one watch always knows what time it is."

I'd rather see the hobbit tool improve monitoring, reports, and other
features that really matter.  Let the HA happen outside of hobbit.
I do try to keep it as simple as possible. The loadbalancing stuff had
almost no impact on the existing code, and if at all possible I'll
isolate this in a separate module so a "normal" single-site setup won't
have to deal with it.

But I do sympathize with your point. You could build a HA Hobbit setup
today using standard tools - shared storage and standard failover
software like the Linux-HA tools - and perhaps that is the best way for
this.
quoted from Scott Walters
I also believe you should only cluster/load-balance when one box can't do
the job.  
That is the problem I was facing recently, so there was no way to avoid
that.
quoted from Scott Walters
Introducing those complexities to increase availability are
usually counterproductive -- you end up taking your system down because it's
so hard to configure/maintain.  And then it usually doesn't work anyway when
it's supposed to.
*grin* yes, this was clearly demonstrated in an incident we had last week at 
work.


Regards,
Henrik
list Tom Kauffman · Tue, 20 Feb 2007 11:48:37 -0500 ·
From where I sit, an active hobbit server with a running hot standby
seems to be fairly easy to implement now. I haven't tried to set it up,
but I've looked at the requirements.

I'm currently running this config; I use the hot standby for initial
server testing on new releases and for checking out different bb-hosts
layouts for cosmetic appeal. All my systems know both server addresses.
And the hot standby does everything except network tests and alerting.

All that would need to happen in the event the primary hobbit server
failed would be to update the hobbitlaunch.cfg to enable the network
testing module and the alerting module. And move the IP address of the
webserver. This should be doable with the currently available HA toolset
for Linux (I'll know more in a few weeks -- I've been promised a new
pair of hobbit servers to implement this on).

This does require suitable network bandwidth to run the data to both
systems, and it will require playing with the hobbit checkpoint file so
the failover system will know the proper enable/disable/ack statuses on
restart.

Tom Kauffman
NIBCO, Inc
quoted from Henrik Størner

-----Original Message-----
From: Henrik Stoerner [mailto:user-ce4a2c883f75@xymon.invalid] 
Sent: Tuesday, February 20, 2007 1:53 AM
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] Loadbalancing Hobbit Server

On Mon, Feb 19, 2007 at 11:36:08PM -0500, Scott Walters wrote:
Because of the complexity of HA solutions and data integrity, I am not
sure
the hobbit code is the right place for the logic.  Similar to the
database
backend, you'll open yourself up to a lot of potential debugging.  I
am a
keep it simple stupid kinda guy and I am reminded of a saying, "A man
with
one watch always knows what time it is."

I'd rather see the hobbit tool improve monitoring, reports, and other
features that really matter.  Let the HA happen outside of hobbit.
I do try to keep it as simple as possible. The loadbalancing stuff had
almost no impact on the existing code, and if at all possible I'll
isolate this in a separate module so a "normal" single-site setup won't
have to deal with it.

But I do sympathize with your point. You could build a HA Hobbit setup
today using standard tools - shared storage and standard failover
software like the Linux-HA tools - and perhaps that is the best way for
this.
I also believe you should only cluster/load-balance when one box can't
do
the job.  
That is the problem I was facing recently, so there was no way to avoid
that.
Introducing those complexities to increase availability are
usually counterproductive -- you end up taking your system down
because it's
so hard to configure/maintain.  And then it usually doesn't work
anyway when
it's supposed to.
*grin* yes, this was clearly demonstrated in an incident we had last
week at 
work.


Regards,
Henrik


CONFIDENTIALITY NOTICE:  This email and any attachments are for the 
exclusive and confidential use of the intended recipient.  If you are not
the intended recipient, please do not read, distribute or take action in 
reliance upon this message. If you have received this in error, please 
notify us immediately by return email and promptly delete this message 
and its attachments from your computer system. We do not waive  
attorney-client or work product privilege by the transmission of this
message.