failover?

11 messages in this thread

list Trever Noggle · Fri, 01 Dec 2006 20:28:14 -0600 ·

Does hobbit support fail over yet?  When I last looked at it the support 
was planned but not implemented.

list Henrik Størner · Sun, 3 Dec 2006 00:10:03 +0100 ·

▸ quoted from Trever Noggle

On Fri, Dec 01, 2006 at 08:28:14PM -0600, Trever noggle wrote:

Does hobbit support fail over yet?  When I last looked at it the support 
was planned but not implemented.

It's not there yet. Some of the stuff I've been working on lately
provides part of the solution for this, but it isn't complete.

Besides, "fail over" means lot of different things. For a true fail over
setup, you'll need some hardware support on top of Hobbit - providing
a virtual IP for your resilient hosts, and probably some sort of shared
storage. Most of that is handled outside Hobbit.

So what exactly do you have in mind ?


Regards,
Henrik

list Trever Noggle · Sat, 02 Dec 2006 21:30:04 -0600 ·

▸ quoted from Henrik Størner

Henrik Stoerner wrote:

On Fri, Dec 01, 2006 at 08:28:14PM -0600, Trever noggle wrote:

Does hobbit support fail over yet?  When I last looked at it the support was planned but not implemented.

It's not there yet. Some of the stuff I've been working on lately
provides part of the solution for this, but it isn't complete.

Besides, "fail over" means lot of different things. For a true fail over
setup, you'll need some hardware support on top of Hobbit - providing
a virtual IP for your resilient hosts, and probably some sort of shared
storage. Most of that is handled outside Hobbit.

So what exactly do you have in mind ?


Regards,
Henrik

I would like to do like you can with BB..  The master and the backups will be on completely different networks.  Basically what I have is a collection of servers and devices to monitor in two different locations.  I want to have a main monitor server at location 1 monitoring devices at both locations.  I then want location 2 to take over if location 1 goes down.

list Anton Burkhalter · Sun, 03 Dec 2006 09:29:42 +0100 ·

▸ quoted from Trever Noggle

Henrik Stoerner wrote:

On Fri, Dec 01, 2006 at 08:28:14PM -0600, Trever noggle wrote:

Does hobbit support fail over yet?  When I last looked at it the support 
was planned but not implemented.

It's not there yet. Some of the stuff I've been working on lately
provides part of the solution for this, but it isn't complete.

Besides, "fail over" means lot of different things. For a true fail over
setup, you'll need some hardware support on top of Hobbit - providing
a virtual IP for your resilient hosts, and probably some sort of shared
storage. Most of that is handled outside Hobbit.

So what exactly do you have in mind ?


Regards,
Henrik

Hi
I use two independent Hobbit servers; each client reports to both
servers. The question is how to synchronize the two
servers after an outage of a server.
Regards,
Toni

list Ralph Mitchell · Sun, 3 Dec 2006 10:15:36 -0600 ·

▸ quoted from Anton Burkhalter

On 12/3/06, Anton Burkhalter <user-0fe67fd59d68@xymon.invalid> wrote:

I use two independent Hobbit servers; each client reports to both
servers. The question is how to synchronize the two
servers after an outage of a server.

How important is it to get the servers synchronised??  I know that
where I work, we're not really using the availability reports, because
there are other tools that track outages and work out the SLA up/down
times.  So, for me, it's not very important to have the Hobbit servers
kept in sync.  If one goes down, when it comes back up it will show
purple dots almost immediately for just about everything, but those
dots change very quickly as the checkout scripts send in fresh
reports.

The thing that concerns me is that I can't be running the same checks
from two servers at the same time.  People around here get irritated
when their webserver stats are artificially inflated by the checkout
scripts... :)  I'm looking at heartbeat (from http://www.linux-ha.org)
to manage that.

Ralph Mitchell

list Mike Rowell · Mon, 4 Dec 2006 10:05:20 -0000 ·

I have a basic ext script that does a basic check to make sure bbd is up
and running on the primary server, if it is not then it fails over to
the secondary server.  As I say it is basic but it does what I need it
to.

The process is as follows;

Secondary Run's hobbit-redundant check Via ext every 5 minutes
If Primary Server Up then do nothing
Else If Primary Server down then copy in place hobbit-alerts.cfg.live 

This is reliant on data being sent to both servers (so having a proxy
sending to both servers or the clients configured to send data to both
bb servers.

Regards,

Mike Rowell

▸ quoted from Anton Burkhalter



-----Original Message-----
From: Anton Burkhalter [mailto:user-0fe67fd59d68@xymon.invalid] 
Sent: 03 December 2006 08:30
To: user-ae9b8668bcde@xymon.invalid
Subject: Re: [hobbit] failover?

Henrik Stoerner wrote:

On Fri, Dec 01, 2006 at 08:28:14PM -0600, Trever noggle wrote:

Does hobbit support fail over yet?  When I last looked at it the
support 
was planned but not implemented.

It's not there yet. Some of the stuff I've been working on lately
provides part of the solution for this, but it isn't complete.

Besides, "fail over" means lot of different things. For a true fail
over
setup, you'll need some hardware support on top of Hobbit - providing
a virtual IP for your resilient hosts, and probably some sort of
shared
storage. Most of that is handled outside Hobbit.

So what exactly do you have in mind ?


Regards,
Henrik

Hi
I use two independent Hobbit servers; each client reports to both
servers. The question is how to synchronize the two
servers after an outage of a server.
Regards,
Toni


This email has been scanned for all viruses by the MessageLabs service.

This email has been scanned for all viruses by the MessageLabs service.

list Daniel J McDonald · Mon, 04 Dec 2006 07:09:56 -0600 ·

▸ quoted from Mike Rowell

On Sun, 2006-12-03 at 00:10 +0100, Henrik Stoerner wrote:

On Fri, Dec 01, 2006 at 08:28:14PM -0600, Trever noggle wrote:

Besides, "fail over" means lot of different things. For a true fail over
setup, you'll need some hardware support on top of Hobbit - providing
a virtual IP for your resilient hosts, and probably some sort of shared
storage. Most of that is handled outside Hobbit.

So what exactly do you have in mind ?

I'd like hobbit-alerts to only run on one box at a time.  Displays and
tests can all run independently

-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281, CNX
Austin Energy
http://www.austinenergy.com

list Henrik Størner · Mon, 4 Dec 2006 17:03:06 +0100 ·

▸ quoted from Daniel J McDonald

On Sun, Dec 03, 2006 at 12:10:03AM +0100, Henrik Stoerner wrote:

Besides, "fail over" means lot of different things. For a true fail over
setup, you'll need some hardware support on top of Hobbit - providing
a virtual IP for your resilient hosts, and probably some sort of shared
storage. Most of that is handled outside Hobbit.

So what exactly do you have in mind ?

I'm replying to my own mail to pick up all the responses that have come
about this.

Trever Noggle:

▸ quoted from Trever Noggle

I would like to do like you can with BB..  The master and the backups
will be on completely different networks [...] I want to have a main 
monitor server at location 1 monitoring devices at both locations.  
I then want location 2 to take over if location 1 goes down.

Anton Burkhalter:

▸ quoted from Mike Rowell

I use two independent Hobbit servers; each client reports to both
servers. The question is how to synchronize the two servers after 
an outage of a server.

Ralph Mitchell:

▸ quoted from Ralph Mitchell

The thing that concerns me is that I can't be running the same checks
from two servers at the same time.  People around here get irritated
when their webserver stats are artificially inflated

Daniel J McDonald:

▸ quoted from Daniel J McDonald

I'd like hobbit-alerts to only run on one box at a time.  Displays and
tests can all run independently


For myself, I might add that having access to the historical data - both
graphs and history logs - is also a requirement.

The simple "do it like BB does" is inadequate - it cannot handle keeping
the historical data up-to-date on both servers, and it also fails to
carry over the current alerts that are active: If the master server sent
out an alert for something before it crashed, and the next alert should
go out 12 hours later, then this repeat-setting isn't transferred to the
slave server. So when the master server drops off the net and the slave
server takes over, it will immediately start by sending out alerts for
everything that is down. Not good.


The current state of a Hobbit server can easily be shared among servers.
The checkpoint files that go into ~hobbit/server/tmp/ can be copied
across to another server, and if you do that often enough then starting
up Hobbit on the other server will pick up all of the current status.
So that part is easy - for convenience I might want to implement some
sort of internal Hobbit protocol for distributing the checkpoint files,
but you can already today just use scp, rsync or similar to copy those
files over. 

The downside of this of course is that something has to recognize when
the primary site is gone, and start up Hobbit on the secondary server.
That is not very attractive; I would rather have Hobbit running on both
servers all the time - this would require some work. But let's assume
for now that this is possible.


Then there are the on-disk files: History logs and graphs. Something has
happened here recently, since it is now possible to distribute these
over multiple servers - and also to have more than one site perform all
of the updates of those files. So instead of periodically copying the
files from a master server to a slave, you just copy them once and then
mirror all of the updates to the relevant servers. The code for this is
in the current snapshots; it isn't documented yet, and hasn't had much
testing. I use it currently for another purpose: Load-sharing of the
updates.


Finally there are the various Hobbit tasks: The display, the network
tests, the alerts. 

The display tasks are very easily distributed to multiple servers - it 
is somewhat inconvenient that there are static webpages built for the 
overview webpages, I want to eliminate those and have all of the webpages 
generated dynamically - but the web display does not have to be on the same 
physical server as the rest of Hobbit, so doing failover for the web 
interface is relatively simple.

Alerts - the code is *almost* ready. It is based on the same principle as
what is used for distributing the history- and RRD-files across multiple
servers; the hobbitd_alert module runs on all of the servers - so it
keeps track of the repeat times etc - but it only actually sends the
alerts from one of the servers at any time.

Network tests - I've heard arguments going both ways as to whether one
should run network tests on all servers ("it is interesting to see if
the site is down when tested from all of our locations, or only from the
primary location"), or on just a single server ("we want to minimize
traffic from the monitoring systems towards the webservers"). I'm still
thinking about how to handle this - if they run on all Hobbit servers,
there has to be some way of choosing which test result should be used;
if they run only on a single server I will probably use the same method
for choosing which server runs the tests as I use to decide who gets to
send out alerts.


Regards,
Henrik

list Trever Noggle · Mon, 4 Dec 2006 11:45:45 -0600 (CST) ·

▸ quoted from Henrik Størner

On Mon, December 4, 2006 10:03 am, Henrik Stoerner wrote:

On Sun, Dec 03, 2006 at 12:10:03AM +0100, Henrik Stoerner wrote:

Besides, "fail over" means lot of different things. For a true fail
over setup, you'll need some hardware support on top of Hobbit -
providing a virtual IP for your resilient hosts, and probably some sort
of shared storage. Most of that is handled outside Hobbit.

So what exactly do you have in mind ?

I'm replying to my own mail to pick up all the responses that have come
about this.

Trever Noggle:

I would like to do like you can with BB..  The master and the backups
will be on completely different networks [...] I want to have a main
monitor server at location 1 monitoring devices at both locations. I
then want location 2 to take over if location 1 goes down.

Anton Burkhalter:

I use two independent Hobbit servers; each client reports to both
servers. The question is how to synchronize the two servers after an
outage of a server.

Ralph Mitchell:

The thing that concerns me is that I can't be running the same checks
from two servers at the same time.  People around here get irritated when
their webserver stats are artificially inflated

Daniel J McDonald:

I'd like hobbit-alerts to only run on one box at a time.  Displays and
tests can all run independently


For myself, I might add that having access to the historical data - both
graphs and history logs - is also a requirement.

The simple "do it like BB does" is inadequate - it cannot handle keeping
the historical data up-to-date on both servers, and it also fails to carry
over the current alerts that are active: If the master server sent out an
alert for something before it crashed, and the next alert should go out 12
hours later, then this repeat-setting isn't transferred to the slave
server. So when the master server drops off the net and the slave server
takes over, it will immediately start by sending out alerts for everything
that is down. Not good.


The current state of a Hobbit server can easily be shared among servers.
The checkpoint files that go into ~hobbit/server/tmp/ can be copied
across to another server, and if you do that often enough then starting up
Hobbit on the other server will pick up all of the current status.
So that part is easy - for convenience I might want to implement some
sort of internal Hobbit protocol for distributing the checkpoint files, but
you can already today just use scp, rsync or similar to copy those files
over.

The downside of this of course is that something has to recognize when
the primary site is gone, and start up Hobbit on the secondary server. That
is not very attractive; I would rather have Hobbit running on both servers
all the time - this would require some work. But let's assume for now that
this is possible.


Then there are the on-disk files: History logs and graphs. Something has
happened here recently, since it is now possible to distribute these over
multiple servers - and also to have more than one site perform all of the
updates of those files. So instead of periodically copying the files from
a master server to a slave, you just copy them once and then mirror all of
the updates to the relevant servers. The code for this is in the current
snapshots; it isn't documented yet, and hasn't had much testing. I use it
currently for another purpose: Load-sharing of the updates.


Finally there are the various Hobbit tasks: The display, the network
tests, the alerts.

The display tasks are very easily distributed to multiple servers - it
is somewhat inconvenient that there are static webpages built for the
overview webpages, I want to eliminate those and have all of the webpages
 generated dynamically - but the web display does not have to be on the
same physical server as the rest of Hobbit, so doing failover for the web
interface is relatively simple.

Alerts - the code is *almost* ready. It is based on the same principle as
 what is used for distributing the history- and RRD-files across multiple
 servers; the hobbitd_alert module runs on all of the servers - so it
keeps track of the repeat times etc - but it only actually sends the
alerts from one of the servers at any time.

Network tests - I've heard arguments going both ways as to whether one
should run network tests on all servers ("it is interesting to see if the
site is down when tested from all of our locations, or only from the
primary location"), or on just a single server ("we want to minimize
traffic from the monitoring systems towards the webservers"). I'm still
thinking about how to handle this - if they run on all Hobbit servers,
there has to be some way of choosing which test result should be used; if
they run only on a single server I will probably use the same method for
choosing which server runs the tests as I use to decide who gets to send
out alerts.


Regards,
Henrik

Personally for me, for what I am wanting to be able to do, I do not care
about history data.  It would be nice but not required.  I simply want the
box at the remote site to take over on all of the display, testing and
paging if the master server (or network) is down.  This way I will still
get alerted if there is a problem at site 1.  And since site 2 will also
be monitored by site 1 I will be alerted if there is a problem at either
site.

It would be nice in the future to be able to have the historical data in
sync on both boxes but that is not something that is important to me at
this point.

-Trever

list T.J. Yang · Mon, 04 Dec 2006 13:52:28 -0600 ·

I hope 4.2.1 can include all-in-one patch plus the following,
1.
Currently 4.2.0 source has "-Ax"(doesn't work on 10.20). it should be like following in
hobbitclient-hp-ux.sh.

UNIX95=1 ps -Al -o pid,ppid,user,stime,state,pri,pcpu,time,vsz,args

2. Solaris 2.6 doesn't have kstat, the ifstat metrict need to be reworked for 2.6

Regards

T.J. Yang

Visit MSN Holiday Challenge for your chance to win up to $50,000 in Holiday cash! http://www.msnholidaychallenge.com/default.aspx?ocid=tagline&locale=en-us

list Joe Sloan · Mon, 04 Dec 2006 17:11:04 -0800 ·

▸ quoted from Trever Noggle


Henrik Stoerner wrote:

Alerts - the code is *almost* ready. It is based on the same principle as
what is used for distributing the history- and RRD-files across multiple
servers; the hobbitd_alert module runs on all of the servers - so it
keeps track of the repeat times etc - but it only actually sends the
alerts from one of the servers at any time.

This would be good news - our bb servers are limping along, held together with
 scotch tape, and we would love to move to hobbit. Only the lack of the
failover capacity of the sort implemented in bb is holding us back.

Please grant us a happy new year!

Joe

failover? 🔗 link

failover?