failover?

list Trever Noggle
Mon, 4 Dec 2006 11:45:45 -0600 (CST)
Message-Id: <user-960a304ea1c9@xymon.invalid>

On Mon, December 4, 2006 10:03 am, Henrik Stoerner wrote:

On Sun, Dec 03, 2006 at 12:10:03AM +0100, Henrik Stoerner wrote:

Besides, "fail over" means lot of different things. For a true fail
over setup, you'll need some hardware support on top of Hobbit -
providing a virtual IP for your resilient hosts, and probably some sort
of shared storage. Most of that is handled outside Hobbit.

So what exactly do you have in mind ?

I'm replying to my own mail to pick up all the responses that have come
about this.

Trever Noggle:

I would like to do like you can with BB..  The master and the backups
will be on completely different networks [...] I want to have a main
monitor server at location 1 monitoring devices at both locations. I
then want location 2 to take over if location 1 goes down.

Anton Burkhalter:

I use two independent Hobbit servers; each client reports to both
servers. The question is how to synchronize the two servers after an
outage of a server.

Ralph Mitchell:

The thing that concerns me is that I can't be running the same checks
from two servers at the same time.  People around here get irritated when
their webserver stats are artificially inflated

Daniel J McDonald:

I'd like hobbit-alerts to only run on one box at a time.  Displays and
tests can all run independently


For myself, I might add that having access to the historical data - both
graphs and history logs - is also a requirement.

The simple "do it like BB does" is inadequate - it cannot handle keeping
the historical data up-to-date on both servers, and it also fails to carry
over the current alerts that are active: If the master server sent out an
alert for something before it crashed, and the next alert should go out 12
hours later, then this repeat-setting isn't transferred to the slave
server. So when the master server drops off the net and the slave server
takes over, it will immediately start by sending out alerts for everything
that is down. Not good.


The current state of a Hobbit server can easily be shared among servers.
The checkpoint files that go into ~hobbit/server/tmp/ can be copied
across to another server, and if you do that often enough then starting up
Hobbit on the other server will pick up all of the current status.
So that part is easy - for convenience I might want to implement some
sort of internal Hobbit protocol for distributing the checkpoint files, but
you can already today just use scp, rsync or similar to copy those files
over.

The downside of this of course is that something has to recognize when
the primary site is gone, and start up Hobbit on the secondary server. That
is not very attractive; I would rather have Hobbit running on both servers
all the time - this would require some work. But let's assume for now that
this is possible.


Then there are the on-disk files: History logs and graphs. Something has
happened here recently, since it is now possible to distribute these over
multiple servers - and also to have more than one site perform all of the
updates of those files. So instead of periodically copying the files from
a master server to a slave, you just copy them once and then mirror all of
the updates to the relevant servers. The code for this is in the current
snapshots; it isn't documented yet, and hasn't had much testing. I use it
currently for another purpose: Load-sharing of the updates.


Finally there are the various Hobbit tasks: The display, the network
tests, the alerts.

The display tasks are very easily distributed to multiple servers - it
is somewhat inconvenient that there are static webpages built for the
overview webpages, I want to eliminate those and have all of the webpages
 generated dynamically - but the web display does not have to be on the
same physical server as the rest of Hobbit, so doing failover for the web
interface is relatively simple.

Alerts - the code is *almost* ready. It is based on the same principle as
 what is used for distributing the history- and RRD-files across multiple
 servers; the hobbitd_alert module runs on all of the servers - so it
keeps track of the repeat times etc - but it only actually sends the
alerts from one of the servers at any time.

Network tests - I've heard arguments going both ways as to whether one
should run network tests on all servers ("it is interesting to see if the
site is down when tested from all of our locations, or only from the
primary location"), or on just a single server ("we want to minimize
traffic from the monitoring systems towards the webservers"). I'm still
thinking about how to handle this - if they run on all Hobbit servers,
there has to be some way of choosing which test result should be used; if
they run only on a single server I will probably use the same method for
choosing which server runs the tests as I use to decide who gets to send
out alerts.


Regards,
Henrik

Personally for me, for what I am wanting to be able to do, I do not care
about history data.  It would be nice but not required.  I simply want the
box at the remote site to take over on all of the display, testing and
paging if the master server (or network) is down.  This way I will still
get alerted if there is a problem at site 1.  And since site 2 will also
be monitored by site 1 I will be alerted if there is a problem at either
site.

It would be nice in the future to be able to have the historical data in
sync on both boxes but that is not something that is important to me at
this point.

-Trever