top ten list of servers wrt cpu load

4 messages in this thread

list Steve Holmes · Mon, 18 Mar 2013 15:55:55 -0400 ·

Before I go inventing something I want to find out if anyone already has
done this.

We have a lot of virtual linux hosts (VMs on an ESX farm). We monitor all
of them with Xymon. When there is a widespread problem (as there was this
past weekend) the virtualization team would like to have a report on which
VMs top the list of, for example, cpu load from Xymon historical data. (Yes
there are ESX based tools, but they have not spent the $$ to put them on
all of the servers.) I pointed the team manager to the metrics report in
Xymon and he was impressed, but doesn't want to have to look at a graph
containing plots for a few hundred hosts to find the top 10.

So, I'm looking it writing a script to mine the rrd or history data from
the Xymon server to produce the list he wants. He is also interested in the
top disk I/O numbers, too, but I'm focusing on load average for now.

He says he just wants an average for each host over the 48 hours of the
weekend, which is when we usually see problems.

Has anyone done this or something like it? I don't see anything in Xymon
already built in to get close so I was looking at rrdtool fetch. However,
this is cumbersome and, frankly I'm not understanding the data I'm getting
back (for example 1.22749483e+03 seems to be 12.27... when I compare it to
the graphs, so the e+03 seems to really mean *10^1, right?)

But I ramble. Thanks for any help.

Steve Holmes
Purdue

list Jeremy Laidman · Tue, 19 Mar 2013 09:27:34 +1100 ·

▸ quoted from Steve Holmes

On 19 March 2013 06:55, Steve Holmes <user-ec1bf77b1b44@xymon.invalid> wrote:

So, I'm looking it writing a script to mine the rrd or history data from
the Xymon server to produce the list he wants. He is also interested in the
top disk I/O numbers, too, but I'm focusing on load average for now.

Sounds useful.  I've not seen anything that does this already.

▸ quoted from Steve Holmes

close so I was looking at rrdtool fetch. However, this is cumbersome and,
frankly I'm not understanding the data I'm getting back (for example
1.22749483e+03 seems to be 12.27... when I compare it to the graphs, so the
e+03 seems to really mean *10^1, right?)

Nope, 1.227+03 means 1227.  However, sometimes there is an adjustment
applied, that's not always obvious.  For example, my understanding is that
the load average (in la.rrd) is recorded after multiplying by 100, which is
an artefact of the BigBrother legacy, because floating-point comparisons
were difficult to implement in a generic shell script that had to run on
any *nix platform.  The BigBrother data collector would chop everything
after two decimal places, then strip the dot out, thus providing a load
average factored up by 100.  You can tell this is what's happening in Xymon
by looking at the [la] entry in graphs.cfg, or to save you looking it up:

        DEF:avg=la.rrd:la:AVERAGE
        CDEF:la=avg,100,/

So the graphs.cfg entry scales it back down before graphing.

Similar adjustments are made for things like interface load and TCP/IP
stats, where bytes-per-second are converted to bits-per-second.  Again, the
graphs.cfg file often gives you a clue as to what's going on.

J

list Steve Holmes · Mon, 18 Mar 2013 21:39:33 -0400 ·


Wherever you go, there you are.

▸ quoted from Jeremy Laidman

On Mar 18, 2013, at 6:27 PM, Jeremy Laidman <user-71895fb2e44c@xymon.invalid> wrote:

On 19 March 2013 06:55, Steve Holmes <user-ec1bf77b1b44@xymon.invalid> wrote:

So, I'm looking it writing a script to mine the rrd or history data from the Xymon server to produce the list he wants. He is also interested in the top disk I/O numbers, too, but I'm focusing on load average for now.

Sounds useful.  I've not seen anything that does this already.

close so I was looking at rrdtool fetch. However, this is cumbersome and, frankly I'm not understanding the data I'm getting back (for example 1.22749483e+03 seems to be 12.27... when I compare it to the graphs, so the e+03 seems to really mean *10^1, right?)

Nope, 1.227+03 means 1227.  However, sometimes there is an adjustment applied, that's not always obvious.  For example, my understanding is that the load average (in la.rrd) is recorded after multiplying by 100, which is an artefact of the BigBrother legacy, because floating-point comparisons were difficult to implement in a generic shell script that had to run on any *nix platform.  The BigBrother data collector would chop everything after two decimal places, then strip the dot out, thus providing a load average factored up by 100.  You can tell this is what's happening in Xymon by looking at the [la] entry in graphs.cfg, or to save you looking it up:

        DEF:avg=la.rrd:la:AVERAGE
        CDEF:la=avg,100,/

So the graphs.cfg entry scales it back down before graphing.

Similar adjustments are made for things like interface load and TCP/IP stats, where bytes-per-second are converted to bits-per-second.  Again, the graphs.cfg file often gives you a clue as to what's going on.

J

Ah, yes, of course!
Thanks
Steve.

list Steve Holmes · Tue, 26 Mar 2013 11:18:11 -0400 ·

▸ quoted from Jeremy Laidman

On Mon, Mar 18, 2013 at 6:27 PM, Jeremy Laidman <user-71895fb2e44c@xymon.invalid>wrote:

On 19 March 2013 06:55, Steve Holmes <user-ec1bf77b1b44@xymon.invalid> wrote:

So, I'm looking it writing a script to mine the rrd or history data from
the Xymon server to produce the list he wants. He is also interested in the
top disk I/O numbers, too, but I'm focusing on load average for now.

Sounds useful.  I've not seen anything that does this already.

close so I was looking at rrdtool fetch. However, this is cumbersome and,
frankly I'm not understanding the data I'm getting back (for example
1.22749483e+03 seems to be 12.27... when I compare it to the graphs, so the
e+03 seems to really mean *10^1, right?)

Nope, 1.227+03 means 1227.  However, sometimes there is an adjustment
applied, that's not always obvious.  For example, my understanding is that
the load average (in la.rrd) is recorded after multiplying by 100, which is
an artefact of the BigBrother legacy, because floating-point comparisons
were difficult to implement in a generic shell script that had to run on
any *nix platform.  The BigBrother data collector would chop everything
after two decimal places, then strip the dot out, thus providing a load
average factored up by 100.  You can tell this is what's happening in Xymon
by looking at the [la] entry in graphs.cfg, or to save you looking it up:

        DEF:avg=la.rrd:la:AVERAGE
        CDEF:la=avg,100,/

So the graphs.cfg entry scales it back down before graphing.

Similar adjustments are made for things like interface load and TCP/IP
stats, where bytes-per-second are converted to bits-per-second.  Again, the
graphs.cfg file often gives you a clue as to what's going on.

J

Attached is the perl script I came up with. It serves my purpose and might
be useful to someone. $fudge will have to be expanded for other measures. I
may do that if someone here at Purdue requires it. Otherwise have at it.

Steve Holmes
Purdue

Attachments (1)

attachment.obj application/octet-stream · 5.7 KB

top ten list of servers wrt cpu load 🔗 link

top ten list of servers wrt cpu load