On 30 January 2014 09:55, Lists <user-bd60c1f964ce@xymon.invalid> wrote:
Recently, we had a publicly visible outtage as a result of one of our
load balancers exceeding the IOPS capability of its system drives.
Ouch!
We'd like to extend xymon (currently installed on CentOS6 /32 with
defaults) so that it can monitor IOPS for all servers.
I like this idea. I looked into this quite a while ago, but really only
scratched the surface.
Specifically, we'd like to see wrqm/s and probably %util. What's the most
straightforward way to accomplish this? The other alternative is to create
some form of internal script, which is doable but not preferable if there's
an off-the-shelf tool available.
Whether an add-on or a new Xymon feature, this would almost certainly
require a new section in the client data. There's already an [iostatdisk]
section used by Solaris and an [iostat] section used by "larrd", although
the format for the latter is a bit funky. So you could replicate either of
these for Linux by adding this into xymonclient-linux.sh:
nohup sh -c "iostat -x 300 2 1>$XYMONTMP/xymon_iostatdisk.$MACHINEDOTS.$$
2>&1; mv $XYMONTMP/xymon_iostatdisk.$MACHINEDOTS.$$
$XYMONTMP/xymon_iostatdisk.$MACHINEDOTS" </dev/null >/dev/null 2>&1 &
if test -f $XYMONTMP/xymon_iostatdisk.$MACHINEDOTS; then echo
"[iostatdisk]"; cat $XYMONTMP/xymon_iostatdisk.$MACHINEDOTS; rm -f
$XYMONTMP/xymon_iostatdisk.$MACHINEDOTS; fi
We might want "-kx" rather than "-x" depending on potential uses. But
doesn't matter for %util and wrqm/s. Adding "-N" (for translating device
names to LVM mappings) might also be useful.
The Xymon parsing code has support only for Solaris. That means it isn't
readily extensible. For other client data sections, the parsing code
typically has a case statement that selects the OS and then parses
according to that. Not the case for iostatdisk or iostat.
In fact, the function that does the parsing - do_iostatdisk_rrd() - is
never called anywhere. So there's a fair bit of work required within Xymon
to get it to work. I'd suggest we get the client side going, then writing
a server-side ext script to emulate the parsing code (feeding into a trends
message for RRD), and then start work on core support for iostatdisk within
xymond.
It's probably a bit more complicated than that. Henrik may have a vision
for universal support of I/O statistics which may be incompatible with what
I'm proposing. Also, we would probably want to maintain compatibility with
the existing [iostat] graph.cfg definition (the only one that uses the
iostat/iostatdisk results), and that means creating RRD files that are
consistent with the DS names and purposes already in use. Also, we may
find that metrics we want to graph are inconsistent with metrics already
defined for the Solaris case that already exists. Also, we'd need to
define a new graph to show the numbers you're interested in, because the
[iostat] graph only shows active/wait service times and %busy. I think
%busy is analogous to %util.
Implementing this kind of thing in such a way that it supports the
majority of OSes, without too much effort, and without significant
conflicts, is quite a challenge. I suspect that's the reason we don't have
anything in the way of I/O usage in Xymon. I've often wondered if using
"sar" is a better way to go, because the output is more (but not
completely) consistent across platforms, and so the parsing code would be
simpler and smaller. Sar is now available on more OSes than ever before,
so we're more likely to see support from hosts we monitor. Clients would
just do a few standard "sar" commands to create client data sections (eg
[sar-d] [sar-b], or even [sar-A] for all available output) and Xymon would
implement a small handful of standardised "sar" parsers. Just an idea.
J