Xymon swarm proposal

5 messages in this thread

list Henrik Størner · Sat, 28 Nov 2015 12:37:11 +0100 ·

Hi,

the recent talk on xymon-developer about rewriting xymonproxy to support TLS, IPv6 and other good stuff made me think about other ways of scaling Xymon across large installations.

Which led me to the idea of having multiple independent Xymon servers - a swarm, because no one Xymon server depends on the others, but they can cooperate.

Simply put, you have a number of independent Xymon installations. Each of them handles a group of servers - it could be one in each of your datacentres, one for each organisational unit, one in each network segment, or just a because you have such a large installation that a single Xymon server cannot cope with the load (and that would be a really big installation, judging by the numbers I hear). This all works just like the Xymon you have today.

The only thing that is needed to have all of these independent Xymon servers show up as a single (virtual) Xymon installation is to have the Xymon webpages - generated by xymongen - to display a set of webpages showing the status of all of the Xymon servers in the swarm. When you click on the detailed status log, you are transparently sent to the Xymon server that holds the data about that server (the URL points to the Xymon server handling the particular server you want to check on).

The nice thing about this is that I think it can be implemented fairly easily, i.e. without having to change anything fundamental in the way the various Xymon programs work. Which means it will also be easy to adapt into an existing Xymon installation, and with a good chance of not introducing difficult-to-troubleshoot bugs (difficult because bugs involving remote systems are always a headache to reproduce).

There are of course a few nitty-gritty details, e.g. "Find host" really should be able to search across all of the servers in the swarm. But those cases are rather few and fairly isolated to not be too much of a headache.

Multiple independent Xymon servers

* Each site runs just like today.
* A new sites.cfg file lists the other sites (just a site ID and how
to contact xymond there)
* Each site UI (the static webpages from xymongen) merges data from
all sites

Advantages

* More resilient - if one site dies, the others will remain operational
* Less cross-site traffic (local data remain local except when needed)
* Less load on each site (updates only go to one Xymon server)
* Horizontally scalable

Limitations

* Hostnames must be unique globally. Probably not a significant problem.
* Functions that fetch data directly from disk-files cannot be
cross-site (rrd-files, history-logs), unless you can retrieve the
data via a network request. In a standard Xymon installation that
would be:
o Availability reports
o Event log reports (but see below)
o Multi-host graphs, unless all of the hosts are local
* Alerts are always handled locally

xymongen

* hosts.cfg file for the page layout must be merged from all sites.
Can be a simple append-one-after-the-other (built-in) or perhaps
allow foran externally generated hosts.cfg - if you want to have
servers from multiple locations on one page.
* How do we handle non-unique pagenames? Transparently prefix them
with the remote site-ID?
* xymondboard data is fetched from multiple sites and combined
(appended) - handled in sendmessage()
* cgi-URL's are generated with a prefix of /SITE/ - no change
otherwise. The local webserver then proxies /SITE/ requests to the
remote site.
* Should there be both a local and a global "all non-green" page?
Maybe even a full set of local and global webpages? That would be
easy by running xymongen twice - one for the local and one for the
global set of pages.

sendmessage() function

* No changes for sending status- or data-updates (status, combo,
extcombo, client, data, modify)
* Option to fetch data from multiple sites. This is already in place
for sending to multiple Xymon servers, so we just need to combine
the output response from multiple sites.
* When processing host-related requests, we learn where the host is
located. Cache this for use by various tools. Must be disk-based
(e.g. SQLite file) so it can be shared.

xymond

* hostinfo requests should only answer for the local hosts. No need to
consult the SQLite cache - no changes.

CGI programs

* "Find host" must be cross-site
* Ack-alert: Suggest making it local-only. Since alerts are only
generated locally, it makes sense to also only ack the local alerts.
* Enable/disable only on the local site? Use the "info" page
enable/disable (automatically local). Global enable/disable needs
some more looking into.
* Critical systems - would probably be nice to be able to do both a
local and a global version.
* Eventlog - would be nice to have both local and global, even though
that means fetching a (large) remote logfile. Will probably require
a new "eventlog" CGI interface for retrieving a remote logfile. It
is probably not something we want to do on every
critical-systems/all-nongreen webpage update. So those could keep
the local eventlog display (as-is), and then the eventlog CGI could
have the option of combining logs from all sites (or maybe a
selection of sites).

xymon commands

_Commands re. specific hosts_
First check via hostinfo cache (see below) if we know where the host is (performance optimization). If not then simply broadcast the message to all sites and combining any data that is returned - there will only be data from one server.

* notify
* disable
* enable
* query
* xymondlog, xymondxlog, clientlog
* hostinfo - sendmessage() will fetch the data for us, whether from
the local xymond or from the SQLite cache.

_Commands that collect data on multiple hosts_

* xymondboard, xymondxboard - option from user whether to fetch local
or global info. Handled in sendmessage()

_Command that only work locally_

* ghostlist
* drop
* rename
* schedule. If done via web i/f it becomes automatically transparent,
but not for scripts. Probably only used for
disable/enable/drop/rename so makes most sense to do it locally.
Doing global would have to parse the message to detect which host it
is about.

Comments are very welcome.

Regards,
Henrik

list Japheth Cleaver · Sat, 28 Nov 2015 10:04:53 -0800 ·

▸ quoted from Henrik Størner

On Sat, November 28, 2015 3:37 am, Henrik StÃ¸rner wrote:

Hi,

the recent talk on xymon-developer about rewriting xymonproxy to support
TLS, IPv6 and other good stuff made me think about other ways of scaling
Xymon across large installations.

Which led me to the idea of having multiple independent Xymon servers -
a swarm, because no one Xymon server depends on the others, but they can
cooperate.

Simply put, you have a number of independent Xymon installations. Each
of them handles a group of servers - it could be one in each of your
datacentres, one for each organisational unit, one in each network
segment, or just a because you have such a large installation that a
single Xymon server cannot cope with the load (and that would be a
really big installation, judging by the numbers I hear). This all works
just like the Xymon you have today.

The only thing that is needed to have all of these independent Xymon
servers show up as a single (virtual) Xymon installation is to have the
Xymon webpages - generated by xymongen - to display a set of webpages
showing the status of all of the Xymon servers in the swarm. When you
click on the detailed status log, you are transparently sent to the
Xymon server that holds the data about that server (the URL points to
the Xymon server handling the particular server you want to check on).

The nice thing about this is that I think it can be implemented fairly
easily, i.e. without having to change anything fundamental in the way
the various Xymon programs work. Which means it will also be easy to
adapt into an existing Xymon installation, and with a good chance of not
introducing difficult-to-troubleshoot bugs (difficult because bugs
involving remote systems are always a headache to reproduce).

There are of course a few nitty-gritty details, e.g. "Find host" really
should be able to search across all of the servers in the swarm. But
those cases are rather few and fairly isolated to not be too much of a
headache.


        Multiple independent Xymon servers


  * Each site runs just like today.
  * A new sites.cfg file lists the other sites (just a site ID and how
    to contact xymond there)
  * Each site UI (the static webpages from xymongen) merges data from
    all sites


        Advantages

  * More resilient - if one site dies, the others will remain operational
  * Less cross-site traffic (local data remain local except when needed)
  * Less load on each site (updates only go to one Xymon server)
  * Horizontally scalable


        Limitations


  * Hostnames must be unique globally. Probably not a significant problem.
  * Functions that fetch data directly from disk-files cannot be
    cross-site (rrd-files, history-logs), unless you can retrieve the
    data via a network request. In a standard Xymon installation that
    would be:
      o Availability reports
      o Event log reports (but see below)
      o Multi-host graphs, unless all of the hosts are local
  * Alerts are always handled locally


        xymongen


  * hosts.cfg file for the page layout must be merged from all sites.
    Can be a simple append-one-after-the-other (built-in) or perhaps
    allow foran externally generated hosts.cfg - if you want to have
    servers from multiple locations on one page.
  * How do we handle non-unique pagenames? Transparently prefix them
    with the remote site-ID?
  * xymondboard data is fetched from multiple sites and combined
    (appended) - handled in sendmessage()
  * cgi-URL's are generated with a prefix of /SITE/ - no change
    otherwise. The local webserver then proxies /SITE/ requests to the
    remote site.
  * Should there be both a local and a global "all non-green" page?
    Maybe even a full set of local and global webpages? That would be
    easy by running xymongen twice - one for the local and one for the
    global set of pages.


        sendmessage() function


  * No changes for sending status- or data-updates (status, combo,
    extcombo, client, data, modify)
  * Option to fetch data from multiple sites. This is already in place
    for sending to multiple Xymon servers, so we just need to combine
    the output response from multiple sites.
  * When processing host-related requests, we learn where the host is
    located. Cache this for use by various tools. Must be disk-based
    (e.g. SQLite file) so it can be shared.


        xymond

  * hostinfo requests should only answer for the local hosts. No need to
    consult the SQLite cache - no changes.


        CGI programs

  * "Find host" must be cross-site
  * Ack-alert: Suggest making it local-only. Since alerts are only
    generated locally, it makes sense to also only ack the local alerts.
  * Enable/disable only on the local site? Use the "info" page
    enable/disable (automatically local). Global enable/disable needs
    some more looking into.
  * Critical systems - would probably be nice to be able to do both a
    local and a global version.
  * Eventlog - would be nice to have both local and global, even though
    that means fetching a (large) remote logfile. Will probably require
    a new "eventlog" CGI interface for retrieving a remote logfile. It
    is probably not something we want to do on every
    critical-systems/all-nongreen webpage update. So those could keep
    the local eventlog display (as-is), and then the eventlog CGI could
    have the option of combining logs from all sites (or maybe a
    selection of sites).


        xymon commands

_Commands re. specific hosts_
First check via hostinfo cache (see below) if we know where the host is
(performance optimization). If not then simply broadcast the message to
all sites and combining any data that is returned - there will only be
data from one server.

  * notify
  * disable
  * enable
  * query
  * xymondlog, xymondxlog, clientlog
  * hostinfo - sendmessage() will fetch the data for us, whether from
    the local xymond or from the SQLite cache.

_Commands that collect data on multiple hosts_

  * xymondboard, xymondxboard - option from user whether to fetch local
    or global info. Handled in sendmessage()

_Command that only work locally_

  * ghostlist
  * drop
  * rename
  * schedule. If done via web i/f it becomes automatically transparent,
    but not for scripts. Probably only used for
    disable/enable/drop/rename so makes most sense to do it locally.
    Doing global would have to parse the message to detect which host it
    is about.


Comments are very welcome.

Regards,
Henrik

Hi,

I think the proposal has a lot of merit.

There are a few bits I feel that might be able to be solved rather easily
with the xymond_locator tech you'd put in a few years ago, but it also
brings up some interesting philosophical questions of how much metadata
gets distributed throughout and by what mechanism.


Some orgs might want almost the exact inverse of swarming (sharding)
whereby xymond is fully replicated for HA and reporting purposes, a la
MySQL. For others, sharding solves network and performance issues and a
unified query for CGIs only, or CLI tools only, makes perfect sense. Still
others might want all xymond's to have a full hosts.cfg reference and
perhaps some basic state data (eg, status metadata, including line1), but
wouldn't want/need the full status going over the wire... or could do with
stachg updates only, with other metadata resync'd on intervals.


For xymond_locator, IIRC histlogs, hostdata, and RRD are working now. The
only things not were per-hostsvc event history (still done with file
reads) and historical hostdata snapshots. With the sites/swarm proposal,
we could simply be pre-specifying what's happening where instead of having
various services checking in for assignment. (Can they check in with
multiple locators now?) You could probably do that now by simply writing
out a static locator.hosts.chk file with all of the 'sticky' fields set
and -- presto... unified dispatch server!

(Of course, given that xymongen is interval-based, you could almost read
that in directly... Seems like a global hosts.cfg and a global
locator.hosts do make things easier.)


If the incoming reports never make it to the distinct xymond's, that's
fine, but xymond_locator's communicating with each other could agree on
the "swarm state" and give you some de facto cluster management outside of
grabbing hostinfo from the xymond's (or really making xymond do anything
else that relies hitting the disk or network).


Speaking of eventlog.cgi, this feels like an opportunity to consider
separating the CGI from the query method as well. As you say, the current
storage method really forces a "everyone respond to this query and I'll
re-assemble into the response for the user" sort of method.

For larger sites, event reporting is really important; so much so that
sending it off to a central DB makes a lot of sense, and that just takes
writing a pretty trivial stachg channel listener to do. The problem has
been that there's no easy CGI for querying that data unless you write your
own too. (Also missing has been a pure clientlog snapshot browser,
distinct from specific status-changing events.)


When you start thinking in terms of a "reporting server" where the slow
stuff happens in response to arbitrary queries, it almost makes sense to
make all of these new message types something that *any* xymond server can
handle -- but shouldn't!

(That is, "Once you start scaling, make a replication/query server and
point general queries to that instead so that live processing remains
unimpeded.")


As far as what it would take to make this happen? One way to encourage
experimentation would be to provide an arbitrary two-way message mechanism
for xymond. The 'usermsg ID' channel works but is one-way, of course...
Perhaps we could create an 'extmsg ID' format, with a locally-configured
ID->TCP:port dispatch in xymonserver.cfg which xymond proxies the
communication off to (non-blockingly, of course), sending anything
received back back to the original sender. People could create all sorts
of custom data backends while still going through a single xymon query
mechanism.


As for everything else, my only other concern was with integrating the
gathering at the sendmessage() vs the application and/or the server level.

It puts the logic for handling issues where one or more of the servers is
down, unreachable, or slow at a very low level, and there might be cause
for having more finely-tuned (or administrator-set) control there over
retries, timeouts, any vs. all's, etc
 for xymongen vs svcstatus vs a
xymonproxy type thing.


Regards,

-jc

list Paul Root · Sat, 28 Nov 2015 21:57:33 +0000 ·

I’m  pretty much the customer for this. I have 3 xymon servers, 1 primary and 2 backup/proxy servers. With lots of firewalling between them. And with those, I have clients that are behind further firewalls.

Like J.C. related, I really want HA. Both proxies can get to all the same things. The primary is a separate network.

I could adapt to this swarm though. It would work pretty well, I think.

I would definitely want local and global non-green screens. Maybe something along the lines of the summary links that we currently have.

What about alerts.cfg? Would each machine have its own alerts.cfg, or could there be one centralized one?

Currently, I have one of the proxies monitor the primary, and if connectivity (or prog status etc.) is lost, it takes over the full alerts configuration.

Paul.

▸ quoted from Henrik Størner

From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Henrik Størner
Sent: Saturday, November 28, 2015 5:37 AM
To: user-834d44be5e50@xymon.invalid; Xymon mailinglist
Subject: [Xymon] Xymon swarm proposal

Hi,

the recent talk on xymon-developer about rewriting xymonproxy to support TLS, IPv6 and other good stuff made me think about other ways of scaling Xymon across large installations.

Which led me to the idea of having multiple independent Xymon servers - a swarm, because no one Xymon server depends on the others, but they can cooperate.


This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.

list Galen Johnson · Mon, 30 Nov 2015 01:24:03 +0000 ·

It's not clear but I only have unidirection access to most of my monitored systems so I have to use xymonfetch.  Would this still work where each datacenter server could be managed/communicate with a local instance that handles the alerts (for example)?  I'm guessing this is so but just want to confirm.   Sounds like I'm in a similar situation as Paul...


=G=

▸ quoted from Paul Root

From: Xymon <xymon-bounces at xymon.com> on behalf of Root, Paul T <user-76fdb6883669@xymon.invalid>
Sent: Saturday, November 28, 2015 4:57 PM
To: 'Henrik Størner'; user-834d44be5e50@xymon.invalid; Xymon mailinglist
Subject: Re: [Xymon] Xymon swarm proposal

I'm  pretty much the customer for this. I have 3 xymon servers, 1 primary and 2 backup/proxy servers. With lots of firewalling between them. And with those, I have clients that are behind further firewalls.

Like J.C. related, I really want HA. Both proxies can get to all the same things. The primary is a separate network.

I could adapt to this swarm though. It would work pretty well, I think.

I would definitely want local and global non-green screens. Maybe something along the lines of the summary links that we currently have.

What about alerts.cfg? Would each machine have its own alerts.cfg, or could there be one centralized one?

Currently, I have one of the proxies monitor the primary, and if connectivity (or prog status etc.) is lost, it takes over the full alerts configuration.

Paul.

From: Xymon [mailto:xymon-bounces at xymon.com] On Behalf Of Henrik Størner
Sent: Saturday, November 28, 2015 5:37 AM
To: user-834d44be5e50@xymon.invalid; Xymon mailinglist
Subject: [Xymon] Xymon swarm proposal

Hi,

the recent talk on xymon-developer about rewriting xymonproxy to support TLS, IPv6 and other good stuff made me think about other ways of scaling Xymon across large installations.

Which led me to the idea of having multiple independent Xymon servers - a swarm, because no one Xymon server depends on the others, but they can cooperate.

This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.

list Matt Vander Werf · Mon, 30 Nov 2015 10:50:14 -0500 ·

This is probably something that I wouldn't use, as I don't really need it
with my current setup (which is perfectly fine).

I assume that if this functionality was put in place that people who wanted
to keep doing everything local (i.e. on one Xymon server machine) would be
able to continue doing that with minimal necessary changes to the current
setup? In other words, this would be parallel functionality that could be
implemented if desired, but wouldn't necessary affect existing setups that
are all local-based (or at least wouldn't affect them very much).

Is my assumption correct here?

That's my only concern with this, as it seems to be changing *a lot* of
things. This would definitely be the preferred way if this functionally was
implemented, especially for new users of Xymon who don't require this
functionality and for users with smaller setups. Even some changes that
would need to be made would be okay, as long as those necessary changes are
explained.

Thanks.

--
Matt Vander Werf

On Sun, Nov 29, 2015 at 8:24 PM, Galen Johnson <user-87f955643e3d@xymon.invalid>

▸ quoted from Galen Johnson

wrote:

It's not clear but I only have unidirection access to most of my monitored
systems so I have to use xymonfetch.  Would this still work where each
datacenter server could be managed/communicate with a local instance that
handles the alerts (for example)?  I'm guessing this is so but just want to
confirm.   Sounds like I'm in a similar situation as Paul...


=G=


*From:* Xymon <xymon-bounces at xymon.com> on behalf of Root, Paul T
<user-76fdb6883669@xymon.invalid>
*Sent:* Saturday, November 28, 2015 4:57 PM
*To:* 'Henrik Størner'; user-834d44be5e50@xymon.invalid; Xymon
mailinglist
*Subject:* Re: [Xymon] Xymon swarm proposal


I’m  pretty much the customer for this. I have 3 xymon servers, 1 primary
and 2 backup/proxy servers. With lots of firewalling between them. And with
those, I have clients that are behind further firewalls.


Like J.C. related, I really want HA. Both proxies can get to all the same
things. The primary is a separate network.


I could adapt to this swarm though. It would work pretty well, I think.


I would definitely want local and global non-green screens. Maybe
something along the lines of the summary links that we currently have.


What about alerts.cfg? Would each machine have its own alerts.cfg, or
could there be one centralized one?


Currently, I have one of the proxies monitor the primary, and if
connectivity (or prog status etc.) is lost, it takes over the full alerts
configuration.


Paul.


*From:* Xymon [mailto:xymon-bounces at xymon.com] *On Behalf Of *Henrik
Størner
*Sent:* Saturday, November 28, 2015 5:37 AM
*To:* user-834d44be5e50@xymon.invalid; Xymon mailinglist
*Subject:* [Xymon] Xymon swarm proposal


Hi,

the recent talk on xymon-developer about rewriting xymonproxy to support
TLS, IPv6 and other good stuff made me think about other ways of scaling
Xymon across large installations.

Which led me to the idea of having multiple independent Xymon servers - a
swarm, because no one Xymon server depends on the others, but they can
cooperate.


This communication is the property of CenturyLink and may contain
confidential or privileged information. Unauthorized use of this
communication is strictly prohibited and may be unlawful. If you have
received this communication in error, please immediately notify the sender
by reply e-mail and destroy all copies of the communication and any
attachments.

Xymon swarm proposal 🔗 link

Xymon swarm proposal